-
-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Caddy unusable when acme server is down #1680
Comments
I worked around this by temporarily specifying |
Hi Jerome, thanks for the question. When a CA is down, that's really a problem, because it means Caddy can't obtain credentials it requires to serve your site securely. And serving a site insecurely is a bad idea.
There are several, they're just not great. One is to provide your own certificates with the
I disagree, this is a security and uptime issue that demands your attention. So, this is not a bug and all is working as intended. |
As someone considering using Caddy... @mholt are you saying that Caddy needs to redownload the private certificate material every time it starts, even if it's within the validity window of a previous issuance? That it grabs and stores them in memory at start-up? |
I think the underlying problem is not a lack of security, the certificate is already present and valid. I suggest that if there is a minimum amount of time left on the certificate (for example atleast 21 days) then Caddy can safely continue operating existing certificates. The site can be securely services for some time, even without the ACME provider being online and I see no problem with ignoring the error as long as sufficient time is left on the cert. Otherwise, the uptime of the backend is directly tied to the uptime of the ACME provider, if they go down, Caddy goes down. At minimum the error of an unavailable ACME provider should not be fatal one if a valid cert is present for all sites with a minimum lifetime of 21 days. There is no security issue in that situation as long as no new certificates are required. |
@mholt Caddy behaves properly as long as the acme server was up when started; it doesn't need a permanent connection to the acme server, nor should it if the certs have already been created established. Please understand this makes Caddy impossible to start if the ACME server is down. |
@mholt Agreed with the others. If the local cert is still valid, and the OCSP data is also current (I think it's refreshed weekly?), then why shouldn't caddy continue to start and serve up using them? The CA should only matter at the time of renewal of either OCSP or the cert. This way, Caddy could start and continue to retry in the background, like it does normally with renewals. |
What nonsense is this? A CA being down for a few minutes does not invalidate already cached certificates. Those are still perfectly valid for a few more weeks. Use them. Until this is addressed I seriously can't imagine ever choosing Caddy for anything again. Simply ridiculous. |
This really impacts my perception of Caddy as production-ready software. |
@r04r @devlinzed agreed. I'm using it on two production sites now. And I was already bitten because the server wouldn't start because one of the sites couldn't get an LE cert (it was an existing site moved to a new server, DNS hadn't propagated) and all sites (even unrelated ones) went down as a result, since the whole caddy server refused to start. The tradeoffs that are being are a red flag for any web server that aims to be production-ready. |
Back to nginx we go. Thanks! |
Anyone looking for an alternative solution, which can issue LE certificates on the fly: take a look at the openresty project in combination with lua-resty-auto-ssl. We've been using it in production for a couple of months now. |
@mholt As long as the local cert is valid, what's the security issue? |
Can you explain in a little more detail? I'm kind of on the fence here. If you have a valid (non-expired) certificate, that vhost should be safe to start. If you need to fetch a new one, then that is definitely worth erroring out and refusing to start, until it's resolved. Otherwise, you risk turning LetsEncrypt into a DoS amplification vector for nearly all Caddy users on the Internet. |
My database doesn't blow up when I only have 300mb of disk space left on the data volume, why should my web server stop working when there are X days left on my local cert because the CA is down? This position is absurd and logically indefensible. |
Geez I wonder why I might be agitated...production site down because of a non-critical issue. I'll file this under 'Security theater' and switch back to nginx. Thanks... see ya. |
I should be finishing my paper for NIPS that's due today but, since this is garnering a LOT of attention from HN... I should clarify some things.
No, definitely not. Caddy does not require access to a CA while the certificate is valid and not expiring. If it is expiring soon (30 days), Caddy will attempt to renew so that your site will stay online.
No, they're stored on disk first, then loaded into memory.
Because there is an error that will cause your site to go down and while the server is still starting, the operator (you) are there to handle the error. Caddy will not serve a site that it believes will go down while you are there to address the issue. What if the certificate expires in 20 minutes or 20 seconds instead of 20 days? Where do you draw the line? We draw it at a conservative 30. Where ever we draw it, people are going to get bit when their CA is down. It's important to note:
If you're "restarting" Caddy by killing the process and restarting it, which does take all your sites down, stop it and use signal USR1 which is a graceful, zero-downtime restart that only applies successful reloads: https://caddyserver.com/docs/cli#usr1 |
I'm not a fan of this decision, but if you're down right now, you should be able to copy the cert and key from the .caddy directory and change to the manual setup of "tls /path/to/cert /path/to/key". |
The Let's Encrypt OCSP responders were also having trouble. Note that Caddy is the only web server to staple OCSP by default. OCSP stapling errors are not "fatal" to Caddy. Further, Caddy stores OCSP staples to disk to be able to weather downtime like this gracefully. Your OCSP is in better hands with Caddy than, say, a default nginx or Apache configuration. (Caddy checks OCSP every hour and updates it halfway through the validity window.) |
You could make it configurable so that people can draw the line for themselves and won't have a reason to complain to you |
You draw the line somewhere around what is a sensible downtime to expect from the ACME server. Let's encrypt being down for 30 days is not something we should expect to happen.
There might be reasons to restart the whole machine, though. |
No need to copy them, you can specify the paths directly as they are in $CADDYPATH (default ~/.caddy). I agree with you though, this is not a great idea in the long run. |
@mholt thank you for reacting so fast <3 |
People will always have a reason to complain. If the window was 7 days instead of 30, it'd just be a different subset of users complaining. |
Replace Caddy with NGINX |
Exactly; which is why it's a conservative window. Note that downtime is not the only problem. We want to be more resistant to blocking attacks whereby packets between your server and the CA are blocked entirely. The attack would have to last a full 30 days to be successful.
You can use the DNS challenge, no DNS propagation necessary. |
That's a reasonable position, but it does seem like there should be a way to override that in case the error is temporary, occurring for reasons beyond your control (i.e. Let's Encrypt being down), and you really do need to start the server right now (e.g. because your site is currently experiencing downtime). |
I know this probably won't get through to @GiorgioG at all, because arguing with an angry person is almost never going to happen, but I want to highlight for the rest of the community that this sort of behavior basically amounts to emotional blackmail.
Yep, totally. But if you let them specify the window, you'd only have the fringes of both subsets complaining. (If you default to 30, you'll probably see even less.) |
This thread has gotten very distracted. Nobody, to clarify, exactly zero people, on this thread, are arguing that Caddy should serve invalid TLS certs, etc. People are arguing that if there is a valid cert, and valid OCSP data, (even if it's valid for 10 seconds), caddy should not refuse to start. Your question about 20 minutes, 20 seconds, my answer is, yes it should serve the site for 20 minutes, and then stop. By then, who knows, maybe the CA will come back up. The point is, just because the cert may expire some time in the FUTURE, doesn't make it any less valid at the present, even if for another 10 seconds. If you have valid credentials, serve the site. If the credentials aren't valid, don't serve the site. What exactly is the debate here about 20 minutes, 20 hours, or 20 days? |
Sorry @mholt, I didn't imagine the HN post would garner this much attention when I linked this there, didn't mean to drop a bombshell :/
Right, I got unlucky. My deployment uses ansible's Which brings me to the following point:
This is simply not true for larger-scale deployments, and if Caddy wants to accomodate those you cannot rely on that at all. Additionally, there is nothing to "handle"; if the acme server is down, there's just nothing anyone can do short of waiting for it to come back up. Using the ACME staging server was a shot in the dark and I was lucky it worked without serving bad certs :) |
I should be finishing my paper for NIPS that is due at 1pm today. Caddy requires certificates for sites that do not have one, have only an expired one cached locally, or have one cached locally that is expiring soon. Caddy's renewal window is 30 days before expiration. In 410ece8 I've changed the "fatal" renewal window to 7 days. This gives you about 3 weeks of downtime or blockage before Caddy will refuse to start. I'm rolling out an emergency release 0.10.3 in a few minutes. |
@mholt Thanks so much! Even among commercial software it's rare for paying customers to be able to directly contact the primary developer of the software at all, let alone have a conversation with them and get a fix released in under 6 hours. This is some pretty amazing turnaround time, especially considering Caddy is completely free. So again, thanks a lot! |
You're welcome. I'm sorry for the trouble. Now I'm going back to my NIPS paper. |
@enilfodne The theory is that, because Caddy starts attempting to auto-renew certs within 30 days of their expiration, the acme server would have to be down three weeks in order for the situation to arise. In practice there's always the possibility that caddy was offline/unused all that time, but most of the time it matches up with the expectation. |
@enilfodne
If Caddy was running before, it should have renewed the certificate in the 3 weeks prior. If not, either the blocking attack or downtime is so long it's basically hopeless for the next 7 days anyway OR your site has been down for 3+ weeks already.
No, because Caddy renews certificates 30 days out when Caddy is running. If it gets to the point where it still needs to renew and it's only 7 days out, then something is seriously, seriously wrong. Either you need a new CA or you're under attack. Both demand your direct attention. And note this only applies to process startup, not continuous running and not USR1 reloads, which are graceful (use them!).
This is exactly what Caddy does, in addition to checking on startup. That check at startup is essential.
Not everyone, unfortunately. Some people are "switching back to nginx" to try to make a point. 🤷♂️ |
@mholt focus on NIPS, you've done more than enough here! |
@mholt Because servers never reboot, right? Why is this issue still closed? |
@r04r Explain. What's the problem? |
What's the paper about? No offense, never perceived you as a NIPS guy. 👍🏻 |
Wow, I'm impressed. @mholt closed an issue, fought off the unreasonable people, discussed with the reasonable people, allowed them to change his mind, pushed an emergency release with a great design that makes everybody happy and does not impact security in any way, and just about finished that NIPS paper - all in the span of a few hours. This really impacts my perception of Caddy as production-ready software. |
Just to add my 2¢ here:
Now for a technical comment: What I'd love to see: being able to configure Caddy in a way to startup despite a few hosts failing to acquire their certificate via ACME, so at least those with an existing certificate keep being properly served while the remaining ones just log their issues and fail gracefully, without tearing the whole Caddy process down, e.g.:
@mholt Thanks a lot for your work, comments and the release! Good luck with your paper! |
@eliasp You should probably open a separate issue for that. |
1. What version of Caddy are you using (
caddy -version
)?0.10.0
2. What are you trying to do?
Start caddy
3. What is your entire Caddyfile?
(any caddyfile with tls enabled)
4. How did you run Caddy (give the full command and describe the execution environment)?
/opt/caddy/caddy --log stdout --agree=true --conf=/opt/caddy/Caddyfile --root=/var/tmp --email=admin@example.com
7. What did you see instead (give full error messages and/or log)?
Activating privacy features...2017/05/19 09:47:19 get directory at 'https://acme-v01.api.letsencrypt.org/directory': failed to get json "https://acme-v01.api.letsencrypt.org/directory": Get https://acme-v01.api.letsencrypt.org/directory: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
8. How can someone who is starting from scratch reproduce the bug as minimally as possible?
With any basic caddyfile that has tls on, caddy will fail if the url
https://acme-v01.api.letsencrypt.org/directory
fails to give a response before the timeout.There doesn't appear to be a workaround for this. Caddy should ignore the error if a certificate is already present and valid.
The text was updated successfully, but these errors were encountered: