Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Caddy unusable when acme server is down #1680

Closed
jleclanche opened this issue May 19, 2017 · 42 comments
Closed

Caddy unusable when acme server is down #1680

jleclanche opened this issue May 19, 2017 · 42 comments
Labels
discussion 💬 The right solution needs to be found

Comments

@jleclanche
Copy link

1. What version of Caddy are you using (caddy -version)?

0.10.0

2. What are you trying to do?

Start caddy

3. What is your entire Caddyfile?

(any caddyfile with tls enabled)

4. How did you run Caddy (give the full command and describe the execution environment)?

/opt/caddy/caddy --log stdout --agree=true --conf=/opt/caddy/Caddyfile --root=/var/tmp --email=admin@example.com

7. What did you see instead (give full error messages and/or log)?

Activating privacy features...2017/05/19 09:47:19 get directory at 'https://acme-v01.api.letsencrypt.org/directory': failed to get json "https://acme-v01.api.letsencrypt.org/directory": Get https://acme-v01.api.letsencrypt.org/directory: net/http: request canceled (Client.Timeout exceeded while awaiting headers)

8. How can someone who is starting from scratch reproduce the bug as minimally as possible?

With any basic caddyfile that has tls on, caddy will fail if the url https://acme-v01.api.letsencrypt.org/directory fails to give a response before the timeout.

There doesn't appear to be a workaround for this. Caddy should ignore the error if a certificate is already present and valid.

@jleclanche
Copy link
Author

I worked around this by temporarily specifying -ca https://acme-staging.api.letsencrypt.org/directory, which is currently up and running...

@mholt
Copy link
Member

mholt commented May 19, 2017

Hi Jerome, thanks for the question. When a CA is down, that's really a problem, because it means Caddy can't obtain credentials it requires to serve your site securely. And serving a site insecurely is a bad idea.

There doesn't appear to be a workaround for this.

There are several, they're just not great. One is to provide your own certificates with the tls directive. Another is to disable HTTPS entirely by specifying your site address with http:// in it. Another is to, as you said, change CAs. However, changing to the staging CA provides an invalid certificate that isn't trusted.

Caddy should ignore the error if a certificate is already present and valid.

I disagree, this is a security and uptime issue that demands your attention.

So, this is not a bug and all is working as intended.

@mholt mholt closed this as completed May 19, 2017
@tobz
Copy link

tobz commented May 19, 2017

As someone considering using Caddy...

@mholt are you saying that Caddy needs to redownload the private certificate material every time it starts, even if it's within the validity window of a previous issuance? That it grabs and stores them in memory at start-up?

@tscs37
Copy link

tscs37 commented May 19, 2017

I disagree, this is a security and uptime issue that demands your attention.

I think the underlying problem is not a lack of security, the certificate is already present and valid.

I suggest that if there is a minimum amount of time left on the certificate (for example atleast 21 days) then Caddy can safely continue operating existing certificates.

The site can be securely services for some time, even without the ACME provider being online and I see no problem with ignoring the error as long as sufficient time is left on the cert.

Otherwise, the uptime of the backend is directly tied to the uptime of the ACME provider, if they go down, Caddy goes down.

At minimum the error of an unavailable ACME provider should not be fatal one if a valid cert is present for all sites with a minimum lifetime of 21 days. There is no security issue in that situation as long as no new certificates are required.

@jleclanche
Copy link
Author

@mholt Caddy behaves properly as long as the acme server was up when started; it doesn't need a permanent connection to the acme server, nor should it if the certs have already been created established.

Please understand this makes Caddy impossible to start if the ACME server is down.

@atonse
Copy link

atonse commented May 19, 2017

@mholt Agreed with the others. If the local cert is still valid, and the OCSP data is also current (I think it's refreshed weekly?), then why shouldn't caddy continue to start and serve up using them?

The CA should only matter at the time of renewal of either OCSP or the cert.

This way, Caddy could start and continue to retry in the background, like it does normally with renewals.

@ghost
Copy link

ghost commented May 19, 2017

I disagree, this is a security and uptime issue that demands your attention.

What nonsense is this? A CA being down for a few minutes does not invalidate already cached certificates. Those are still perfectly valid for a few more weeks. Use them.

Until this is addressed I seriously can't imagine ever choosing Caddy for anything again. Simply ridiculous.

@devlinzed
Copy link

This really impacts my perception of Caddy as production-ready software.

@atonse
Copy link

atonse commented May 19, 2017

@r04r @devlinzed agreed. I'm using it on two production sites now. And I was already bitten because the server wouldn't start because one of the sites couldn't get an LE cert (it was an existing site moved to a new server, DNS hadn't propagated) and all sites (even unrelated ones) went down as a result, since the whole caddy server refused to start.

The tradeoffs that are being are a red flag for any web server that aims to be production-ready.

@GiorgioG
Copy link

Back to nginx we go. Thanks!

@luto
Copy link

luto commented May 19, 2017

Anyone looking for an alternative solution, which can issue LE certificates on the fly: take a look at the openresty project in combination with lua-resty-auto-ssl. We've been using it in production for a couple of months now.

@GiorgioG
Copy link

Caddy should ignore the error if a certificate is already present and valid.

I disagree, this is a security and uptime issue that demands your attention.

@mholt As long as the local cert is valid, what's the security issue?

@paragonie-scott
Copy link

I disagree, this is a security and uptime issue that demands your attention.

Can you explain in a little more detail? I'm kind of on the fence here.

If you have a valid (non-expired) certificate, that vhost should be safe to start. If you need to fetch a new one, then that is definitely worth erroring out and refusing to start, until it's resolved.

Otherwise, you risk turning LetsEncrypt into a DoS amplification vector for nearly all Caddy users on the Internet.

@GiorgioG
Copy link

My database doesn't blow up when I only have 300mb of disk space left on the data volume, why should my web server stop working when there are X days left on my local cert because the CA is down? This position is absurd and logically indefensible.

@paragonie-scott
Copy link

@GiorgioG You keep commenting without giving @mholt a chance to respond, and your latest comment seems increasingly agitated.

Maybe take a break from the computer for a bit? Cooler heads prevail.

@GiorgioG
Copy link

GiorgioG commented May 19, 2017

Geez I wonder why I might be agitated...production site down because of a non-critical issue. I'll file this under 'Security theater' and switch back to nginx. Thanks... see ya.

@mholt
Copy link
Member

mholt commented May 19, 2017

I should be finishing my paper for NIPS that's due today but, since this is garnering a LOT of attention from HN... I should clarify some things.

@tobz

are you saying that Caddy needs to redownload the private certificate material every time it starts, even if it's within the validity window of a previous issuance?

No, definitely not. Caddy does not require access to a CA while the certificate is valid and not expiring. If it is expiring soon (30 days), Caddy will attempt to renew so that your site will stay online.

That it grabs and stores them in memory at start-up?

No, they're stored on disk first, then loaded into memory.

If the local cert is still valid, and the OCSP data is also current (I think it's refreshed weekly?), then why shouldn't caddy continue to start and serve up using them?

Because there is an error that will cause your site to go down and while the server is still starting, the operator (you) are there to handle the error. Caddy will not serve a site that it believes will go down while you are there to address the issue.

What if the certificate expires in 20 minutes or 20 seconds instead of 20 days? Where do you draw the line? We draw it at a conservative 30. Where ever we draw it, people are going to get bit when their CA is down.

It's important to note:

  • Caddy doesn't take your site offline because it can't renew a certificate. If you're starting Caddy for the first time, there are no sites to take offline because they were already not online.

If you're "restarting" Caddy by killing the process and restarting it, which does take all your sites down, stop it and use signal USR1 which is a graceful, zero-downtime restart that only applies successful reloads: https://caddyserver.com/docs/cli#usr1

@budgetneon
Copy link

I'm not a fan of this decision, but if you're down right now, you should be able to copy the cert and key from the .caddy directory and change to the manual setup of "tls /path/to/cert /path/to/key".

@mholt
Copy link
Member

mholt commented May 19, 2017

The Let's Encrypt OCSP responders were also having trouble. Note that Caddy is the only web server to staple OCSP by default. OCSP stapling errors are not "fatal" to Caddy. Further, Caddy stores OCSP staples to disk to be able to weather downtime like this gracefully. Your OCSP is in better hands with Caddy than, say, a default nginx or Apache configuration. (Caddy checks OCSP every hour and updates it halfway through the validity window.)

@stephenwilliams
Copy link

What if the certificate expires in 20 minutes or 20 seconds instead of 20 days? Where do you draw the line? We draw it at a conservative 30. Where ever we draw it, people are going to get bit when their CA is down.

You could make it configurable so that people can draw the line for themselves and won't have a reason to complain to you

@Chronial
Copy link

What if the certificate expires in 20 minutes or 20 seconds instead of 20 days? Where do you draw the line? We draw it at a conservative 30. Where ever we draw it, people are going to get bit when their CA is down.

You draw the line somewhere around what is a sensible downtime to expect from the ACME server. Let's encrypt being down for 30 days is not something we should expect to happen.

If you're "restarting" Caddy by killing the process and restarting it,

There might be reasons to restart the whole machine, though.

@mholt
Copy link
Member

mholt commented May 19, 2017

@budgetneon

I'm not a fan of this decision, but if you're down right now, you should be able to copy the cert and key from the .caddy directory and change to the manual setup of "tls /path/to/cert /path/to/key".

No need to copy them, you can specify the paths directly as they are in $CADDYPATH (default ~/.caddy). I agree with you though, this is not a great idea in the long run.

@erdii
Copy link

erdii commented May 19, 2017

@mholt thank you for reacting so fast <3

@mholt
Copy link
Member

mholt commented May 19, 2017

@stephenwilliams

You could make it configurable so that people can draw the line for themselves and won't have a reason to complain to you

People will always have a reason to complain. If the window was 7 days instead of 30, it'd just be a different subset of users complaining.

@ghost
Copy link

ghost commented May 19, 2017

Replace Caddy with NGINX

@mholt
Copy link
Member

mholt commented May 19, 2017

@Chronial

You draw the line somewhere around what is a sensible downtime to expect from the ACME server. Let's encrypt being down for 30 days is not something we should expect to happen.

Exactly; which is why it's a conservative window. Note that downtime is not the only problem. We want to be more resistant to blocking attacks whereby packets between your server and the CA are blocked entirely. The attack would have to last a full 30 days to be successful.

@atonse

DNS hadn't propagated

You can use the DNS challenge, no DNS propagation necessary.

@Ajedi32
Copy link

Ajedi32 commented May 19, 2017

Because there is an error that will cause your site to go down and while the server is still starting, the operator (you) are there to handle the error. Caddy will not serve a site that it believes will go down while you are there to address the issue.

That's a reasonable position, but it does seem like there should be a way to override that in case the error is temporary, occurring for reasons beyond your control (i.e. Let's Encrypt being down), and you really do need to start the server right now (e.g. because your site is currently experiencing downtime).

@paragonie-scott
Copy link

I know this probably won't get through to @GiorgioG at all, because arguing with an angry person is almost never going to happen, but I want to highlight for the rest of the community that this sort of behavior basically amounts to emotional blackmail.

If the window was 7 days instead of 30, it'd just be a different subset of users complaining.

Yep, totally. But if you let them specify the window, you'd only have the fringes of both subsets complaining. (If you default to 30, you'll probably see even less.)

@atonse
Copy link

atonse commented May 19, 2017

This thread has gotten very distracted.

Nobody, to clarify, exactly zero people, on this thread, are arguing that Caddy should serve invalid TLS certs, etc. People are arguing that if there is a valid cert, and valid OCSP data, (even if it's valid for 10 seconds), caddy should not refuse to start.

Your question about 20 minutes, 20 seconds, my answer is, yes it should serve the site for 20 minutes, and then stop. By then, who knows, maybe the CA will come back up. The point is, just because the cert may expire some time in the FUTURE, doesn't make it any less valid at the present, even if for another 10 seconds.

If you have valid credentials, serve the site. If the credentials aren't valid, don't serve the site. What exactly is the debate here about 20 minutes, 20 hours, or 20 days?

@jleclanche
Copy link
Author

Sorry @mholt, I didn't imagine the HN post would garner this much attention when I linked this there, didn't mean to drop a bombshell :/

If you're "restarting" Caddy by killing the process and restarting it, which does take all your sites down, stop it and use signal USR1 which is a graceful, zero-downtime restart that only applies successful reloads: https://caddyserver.com/docs/cli#usr1

Right, I got unlucky. My deployment uses ansible's service on Debian, which is an abstraction for systemctl restart caddy.service. The service file I'm using is the upstream one, which specifies ExecReload=/bin/kill -USR1 $MAINPID. So I do use USR1, however this was came out of a server hardware downgrade, which required a reboot. Therefore, Caddy had to boot from scratch.

Which brings me to the following point:

the operator (you) are there to handle the error

This is simply not true for larger-scale deployments, and if Caddy wants to accomodate those you cannot rely on that at all. Additionally, there is nothing to "handle"; if the acme server is down, there's just nothing anyone can do short of waiting for it to come back up. Using the ACME staging server was a shot in the dark and I was lucky it worked without serving bad certs :)

@mholt
Copy link
Member

mholt commented May 19, 2017

I should be finishing my paper for NIPS that is due at 1pm today.

Caddy requires certificates for sites that do not have one, have only an expired one cached locally, or have one cached locally that is expiring soon. Caddy's renewal window is 30 days before expiration.

In 410ece8 I've changed the "fatal" renewal window to 7 days. This gives you about 3 weeks of downtime or blockage before Caddy will refuse to start.

I'm rolling out an emergency release 0.10.3 in a few minutes.

@Ajedi32
Copy link

Ajedi32 commented May 19, 2017

@mholt Thanks so much!

Even among commercial software it's rare for paying customers to be able to directly contact the primary developer of the software at all, let alone have a conversation with them and get a fix released in under 6 hours. This is some pretty amazing turnaround time, especially considering Caddy is completely free. So again, thanks a lot!

@mholt
Copy link
Member

mholt commented May 19, 2017

You're welcome. I'm sorry for the trouble.

Now I'm going back to my NIPS paper.

@mholt mholt added the discussion 💬 The right solution needs to be found label May 19, 2017
@jleclanche
Copy link
Author

@enilfodne The theory is that, because Caddy starts attempting to auto-renew certs within 30 days of their expiration, the acme server would have to be down three weeks in order for the situation to arise.

In practice there's always the possibility that caddy was offline/unused all that time, but most of the time it matches up with the expectation.

@mholt
Copy link
Member

mholt commented May 19, 2017

@enilfodne

if the server is restarted/started with a LE cert, that have less than 7 days left to renewal, the same issue will manifest if the LE infrastructure has issues.

If Caddy was running before, it should have renewed the certificate in the 3 weeks prior. If not, either the blocking attack or downtime is so long it's basically hopeless for the next 7 days anyway OR your site has been down for 3+ weeks already.

This effectively cuts the certificate lifetime with 7 days and puts strain on LE's infrastructure for no arguable increase in certificate safety.

No, because Caddy renews certificates 30 days out when Caddy is running. If it gets to the point where it still needs to renew and it's only 7 days out, then something is seriously, seriously wrong. Either you need a new CA or you're under attack. Both demand your direct attention. And note this only applies to process startup, not continuous running and not USR1 reloads, which are graceful (use them!).

Always using the existing files and periodically running upgrade (in the background, not on startup) to "refresh" the certificates.

This is exactly what Caddy does, in addition to checking on startup. That check at startup is essential.

i expect everyone in this thread is interested in the future of Caddy.

Not everyone, unfortunately. Some people are "switching back to nginx" to try to make a point. 🤷‍♂️

@theonewolf
Copy link

@mholt focus on NIPS, you've done more than enough here!

@ghost
Copy link

ghost commented May 19, 2017

@mholt Because servers never reboot, right? Why is this issue still closed?

@mholt
Copy link
Member

mholt commented May 19, 2017

@r04r Explain. What's the problem?

@sebastianmarkow
Copy link

sebastianmarkow commented May 19, 2017

I should be finishing my paper for NIPS that's due today

What's the paper about? No offense, never perceived you as a NIPS guy. 👍🏻

@eteeselink
Copy link

eteeselink commented May 19, 2017

Wow, I'm impressed. @mholt closed an issue, fought off the unreasonable people, discussed with the reasonable people, allowed them to change his mind, pushed an emergency release with a great design that makes everybody happy and does not impact security in any way, and just about finished that NIPS paper - all in the span of a few hours.

This really impacts my perception of Caddy as production-ready software.

@eliasp
Copy link

eliasp commented May 19, 2017

Just to add my 2¢ here:

  • we're using Caddy in production since more than a year. It provided us great service and helped us to avoid all the hassle of doing certificate management manually/properly which avoided costs, saved our minds' sanity and allowed us to rapidly deploy a lot of infrastructure from scratch. In case our 2nd funding round works out as expected, Caddy is my top-priority FLOSS project to receive a donation.
  • Caddy isn't perfect, but which software is? At least I haven't found a software before which puts so much focus on "simply works" while still providing a lot of power and flexibility. The bugs/issues we ran into were negligible and far below the baseline of what I expected from such a young software.
  • The support through GitHub issues, Gitter discussions etc was awesome, always helpful and based on technical, not ideological talking points.
  • Stuff like "I'm back to nginx" helps no-one and just creates useless tension in such a discussion

Now for a technical comment:
In our case, we're using Caddy (amongst other scenarios) as TLS endpoint/load-balancer which does nothing but to serve a lot of different domains and route them through to various backends (most of them fronted by Caddy as well where it handles rewrites etc.).
The configuration is generated through our config management, so it changes relatively often, which also means: new hosts/domains are added on a regular basis.
In this case, the new domains didn't have any pre-existing certificates yet, so Caddy failed completely (yes, I know about the SIGUSR1 stuff, but we also have to consider the reboot scenario and by default our config management also used restart instead of reload) on startup.

What I'd love to see: being able to configure Caddy in a way to startup despite a few hosts failing to acquire their certificate via ACME, so at least those with an existing certificate keep being properly served while the remaining ones just log their issues and fail gracefully, without tearing the whole Caddy process down, e.g.:

tls {
  fail_mode log|error
}

@mholt Thanks a lot for your work, comments and the release! Good luck with your paper!

@Chronial
Copy link

@eliasp You should probably open a separate issue for that.

@caddyserver caddyserver locked and limited conversation to collaborators Sep 18, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
discussion 💬 The right solution needs to be found
Projects
None yet
Development

No branches or pull requests