New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cert-manager v0.8.0 and v0.8.1 send excessive traffic #1948
Comments
I received an email from Let's Encrypt an hour ago about this. I'm actually running 0.9.0 but gone through different versions during setup because none certificates reached ready state (was LB issue). I had one certificate for norad.fr that stayed pending (on 0.9.0) with http challenge for maybe a week that could have done too many requests. I'm not sure at which step in logs cert-manager is calling let's encrypt but I found : k -n cert-manager logs cert-manager-6554467ddb-nbb6d | grep norad.fr | grep 'propagation check failed'
2987 The issue that prevented challenge completion was that my 80 port on norad.fr is hosted by a http redirect server, while I wanted in fact a certificate for www.norad.fr (where 80 port is hosted on kube cluster). I don't know if that help |
Running 0.5.2 (old, I know 😓) but no issues with excessive traffic that I can tell. Been running for ~200 days and primarily uses DNS validation. |
I just received an email from LetsEncrypt over running an earlier version of cert-manager. I'm glad for the email and the careful handling of the issue. Thank you LetsEncrypt team! |
We just got the email 4am this morning (GMT) and one of our certs expired at 3pm today so - we can't get a re-issue now due to the 503 (perhaps an IP ban?). Cert-manager v0.7.2. I restarted the cert-manager pod and it immediately started a renew loop. Not sure how long it's been doing it but presumably quite a while as I believe they default to renew 30 days before expiry? Logs are below.
|
This issue is not allowing to update the version: #1255 |
Hello, updated mine today to v0.9.0. |
I tried to use v.0.9.1, but for some reason it will issue a "Temporary certificate" and for some time this certificate that is not trusted it will be available. Can this be avoided(for older versions this didn't happen)? Thank you!
Events for version>v.0.8.1:
Is there anyway to exclude the step "Generated temporary self signed certificate"? |
@RaduRaducuIlie How did you install v0.9.1? |
I'm on 0.9.0. Cleaned up some test ingresses I had lying around. And noticed (afterwards) that I had a |
@rnkhouse,cert-manager is installed wi5h istio and I just updated the image tag for cert-manager. |
Hello, I want to help with my metrics but I don't know how to count the requests. It would be great if you gave us:
If you post this, I'm pretty sure you'll get a lot of feedback :) (including mine) |
I've encountered the second pattern on a fresh install of cert-manager v0.9.1 on k8s v1.15.2 |
@munnerz Is anyone from Jetstack planning to engage with this issue? It's somewhat disheartening to see affected users reporting in (some with data about the problem, some asking for help collecting that data) and no one from Jetstack has replied yet. There are a few posts mentioning the current most version having a pattern of excessive traffic that may lead to Let's Encrypt having to block that version as well. |
@cpu @JoshVanL has been looking through logs to try and find anything suspect - we've also been discussing with others on Slack and gathering info too. Is there any data on the percentage of unique accounts this is affecting? i.e. N% of accounts registered using cert-manager are showing abusive traffic patterns? |
@munnerz @JoshVanL Great, glad to hear that this is on your radar. Do you have any advice for @anderspetersson's questions? It also sounds like @aparaschiv was able to reproduce this from a brand new install. Could you collaborate with them to reproduce the problem?
Our log analysis platform is not particularly well suited to answering questions like this about a proportion of UAs that meet some other 2nd level criteria like request volume. I'll ask internally to see if we can pull this somehow. |
From your logs, this looks to be the error you're talking about. It is expected that we'd retry this kind of error as it's a timeout completing a request with the ACME server - this indicates either a network issue, or some other problem access the Let's Encrypt API. Looking at the timestamps, it seems like the exponential back-off is being applied correctly (specified here: https://github.com/jetstack/cert-manager/blob/582371a1db8469710437b3900bf533c3b3bdffb6/pkg/controller/util.go#L38):
That said, I do notice you have this error at the end:
which indicates that the 'webhook' component has not started correctly, which will also cause issues persisting data (which will cause us to retry and apply exponential backoff). That said, from my understanding the types of abusive traffic patterns we are looking for are more than 1 request every 5 minutes, and more in the region of multiple requests per second We expose ACME client library Prometheus metrics which can be used to identify abusive traffic patterns - from there, a full copy of your logs would be appreciated. The prometheus metrics we expose also contain the response status code from the ACME server:
|
This observation from @ryangrahamnc also seems like a promising avenue for debugging. |
I'd be very surprised if the issue I had (logs posted above) was due to two ingresses referencing the same secret as we have only ever seen this once on one certificate and all our resources are formulaic, automated by pulumi and never shared. We disable pulumi autonaming for ingresses and services so there shouldn't be a way for that to have happened, for us. |
@ryangrahamnc we added a check for this a little while ago: https://github.com/jetstack/cert-manager/blob/582371a1db8469710437b3900bf533c3b3bdffb6/pkg/controller/certificates/sync.go#L131-L148 this was first in a release in v0.9.0: #1689 Can you share your log messages as it's odd that you're seeing this... 😬 |
@jsha I got an email on pre 0.8 cert managers being blocked. However, for some reason the email is now without content in my email system so I am not sure of the exact content - was it recalled? We downgraded to cert-manager 0.7 a while back because of this infinite loop problem and it solved the problem in our case on both Azure and Google cloud. It would be unfortunate if we are forced to upgrade to an unstable cert-manager version since apparently the issue hasn't been fixed. Does this mean we need to find some solution other than cert-manager? |
@mikkelfj the content of the email was also shared in a community forum thread if you need to reference a copy. |
@mikkelfj could you also share your log messages using 0.9.1 to help us dig into this for you? 😀 |
I don't recall what version we were running before rolling back to 0.7, it was probably 6-12 months ago. We are currently running 0.7 reliably and that is all we have logs for. The above discussion suggests that the latest version is still not stable. As to 0.7 it works for us, but now I think about it, perhaps I did see some suspect log content a while back but at least we get new certs for now. If I set up a 0.9.1, I'll let you know how it goes. |
After upgrading from
Unfortunately I don't have the logs anymore and won't try to upgrade again (I tried it like 10 times) because I'm trying to recover from hitting the rate-limit (which resulted in quite critical infrastructure being unavailable) I don't know if it's related, but I'd configured 1 ingress with 2 domains and the same secret |
I've opened #2155 to update this ^ 😄 |
Thanks for the update @munnerz ! I also hope that helps. However, I think it's still not quite enough. What I'd really like to see is a system where, even if someone manages to bypass the guardrails you're adding and run two instances of cert-manager, it doesn't go into pathological traffic mode. So, for instance, you mentioned that you think the problem is due to the two instances overwriting a single Resource. Can you make it so that each instance names its Resources randomly so that there is very little chance multiple instances will be contending over one? |
Hello, I am using Rancher v2 and I am currios if we should upgrade cert-manager using the official repo posted here in comments or the update stuff will be available also via https://github.com/helm/charts/tree/master/stable/cert-manager? |
I found my answer here: https://rancher.com/docs/rancher/v2.x/en/installation/options/upgrading-cert-manager/ :) |
Given the way that Kubernetes controllers work, this isn't really possible. These resources are created and named by end-users, not just by cert-manager. Some resources (i.e. Orders) are created by cert-manager in response to 'user actions', but there's no reliable way for us to shard processing in the way you describe without potentially ending out in situations where no instances will process the resource. The very decoupled nature of Kubernetes is designed around the idea that different actors can modify/manipulate resources, which aids extensibility, however if a user runs two controllers that 'compete' with each other, you've effectively got a situation where one person is turning the heating on, whilst someone else is continuously turning the heating off. Leader election et al is meant to address this sort of thing to ensure only one instance runs at a time. When users run multiple instances (and worse, when these instances have mismatched versions), it's effectively like running a concurrency sensitive application without any locks. We've now made the change (and it's rolled out to v0.11) to make it harder to actually configure things in this way (it was too easy in the past), so I'm keen to see how the results look there. Given that you're seeing approx. 2% of accounts express abusive traffic patterns, I still think that these instances are down to misconfigurations/bad deployments (and also the issues like you describe in #2194) - I am confident we can continue to reduce this number with on-going changes, and I think you'd agree compared to a few months ago on earlier releases, we've managed to reduce the total % of abusive accounts fairly significantly (previously, I do believe we had a far higher proportion of our users with abusive patterns). Happy to set up a call or any other kind of chat to go over it in a bit more depth 😄 I appreciate this isn't the simplest concept, and it's a bit tricky to explain it all here 😅 |
It sounds like "resources" is probably the wrong abstraction for cert-manager to store its internal state in. What if cert-manager stored its internal state on disk in its container? I understand cert-manager may want to make a certificate resource available so other components (like Nginx) can consume it, but cert-manager could treat the certificate resource as output-only, treating its on-disk state as authoritative. BTW, I tried to look up "resources" in the Kubernetes documentation but didn't find something that seemed to match the concept here. Are we talking about Kubernetes Objects?
I think you're probably right that misconfiguration is the cause of this excessive traffic, but it's a very common misconfiguration, and I can see why - it seems like it's easy in Kubernetes to lose track of the fact that you've already got a cert-manager instance deployed. Even if it were a rare misconfiguration, it would be important that cert-manager fail cleanly, sending zero traffic rather than sending thousands of times more traffic than normal. While only 2% of cert-manager instances sent high traffic, at times those instances represented 40% of all Let's Encrypt API requests.
Yes, I think cert-manager has made a ton of great progress in recent versions. I really appreciate your work on this! I want to get to the point where 0% of cert-manager clients are abusive, and I think we can get there, but it will probably take some significant design changes. |
I think the new locking changes will help significantly. I don't think it can ever be reduced to 0%. Any software can be abused. What is reasonable to do is have 0% of it be accidental, but instead, all remaining is malicious. I think one of the next remaining checks would be to ensure that if there was a cluster-wide cert-manager installed, that a namespace only one wouldn't start. |
After some more careful consideration on this, I think the v0.11 release will significantly improve this due to the change we made to use the This should massively help, as it'll prevent the 'fighting' behaviour, meaning that newer releases will operate just fine. The older release is likely to sit and not do much (depending on the version), as it won't be able to observe its own state changes and so, won't re-sync the resource. To further insulate us from issues like this in future, I've also opened #2219 which will go a step further and make the ACME Order details immutable once set on our Order resources. This should, once again, prevent fighting as these values will no longer be able to 'flip-flop'. In the event that two controllers do start to do this, the apiserver will actually reject changes to these fields, which will cause a 4xx error to be returned to the The above 2, plus the leader election changes, I believe will resolve this issue altogether.
Yes - agreed.
This is a difficult heuristic to develop IMO. That said, with the new leader election changes, if a user tries to deploy a cluster scoped version of cert-manager as well as a namespace scoped version, they will both have the same leader election namespace set (unless the user explicitly changes it), which will mean they won't 'compete'. I'd be more in favour of supporting 'namespace scoped cert-manager' as a first-class feature, and then having a Relevant to this is the discussions we've had in the past about switching to controller-runtime, which has better support for running informers against multiple namespaces at once. But this is starting to veer far off the original topic, so I'll not go into too much detail here 😄 |
@jsha regarding cert-manager state, our interactions with other tools in the ecosystem etc., I'd be happy to set up a quick call to go over some of these details. I appreciate your suggestions, but I don't think it is fair that due to a number of users that have misconfigured older and newer clients, we should significantly re-architect the entire project.
Yes
I am not sure if it's fair to say it's easy to lose track - I think some users do it by accident, similar to how some users may install two copies of the same application on their own computer. Kubernetes is a powerful tool, but it must be used properly (and changes to our leader election config will help users to not burn themselves here).
👍 - agreed, and I think we've made some significant changes in v0.11 (and also, v0.12), that are mentioned above. I'm hopeful that this will quash that remaining 2%, and I believe that if you dig into the numbers, you'll observe far fewer users of v0.11 and v0.12 that are showing abusive traffic patterns whilst also running an older version.
I think we can get here too 😄 (although excluding users who intentionally are trying to circumvent the rules/cause problems). That said, I am fairly confident it won't take significant design changes 😅 |
These both look like really positive changes (though I'll admit to not fully understanding how #2097 works). I'm optimistic these will further reduce the problem.
I think it depends on how serious you consider this class of bugs, and whether you think it's the user's fault when they hit them. I've tended to consider this an issue with the software rather than the user, because every user I've reached out to doesn't realize what's going on - there's no good way for them to notice. A big part of why I consider this class of bugs to be serious is that it's non-linear. Yes, a user can always make a mistake and install two copies of a program; that would typically use twice the resources. But under our current understanding, installing two copies of cert-manager can result in 100,000-1,000,000 times as many requests as installing just one copy (based on an expected "normal" traffic of 10 requests per renewal period, or more generously 10 requests per day). It's not clear to me how big a reorganization it would be to move to internal storage; it may be prohibitive. I'd be curious to hear more. My intuition that it's worthwhile is because, so far, a series of fixes to address specific symptoms haven't succeeded in fully addressing the problem. Usually that means that the problems need a more structural approach.
Thanks! I'll send you an email to schedule. |
... 'helpful' github automation closed this issue - re-opening it so that we can explicitly close it when we're happy 😄 |
(to clarify, the changes in #2219, and various others, should definitely help significantly in those cases where users are running multiple instances of cert-manager with leader election not properly enabled, but we should wait for some kind of statistical validation of that first!) |
Hi, Im gettign following error for my GKE cluster. My domain is registered in godaddy.com cert-manager/controller/challenges "msg"="propagation check failed" "error"="failed to perform self check GET request 'http://www.abc.com/.well-known/acme-challenge/AZl_evY1PscNKi95EdfFNuYG_Gl75-Hi8we7Efbyy7I': Get http://www.abc/.well-known/acme-challenge/AZl_evY1PscNKi95E3EFNxcdfl75-Hi8we7Efbyy7I: dial tcp: lookup www.abc.in on 10.12.244.10:53: no such host" "dnsName"="www.abc.com" "resource_kind"="Challenge" "resource_name"="abc-3611830638-3386842356-1600275291" "resource_namespace"="default" "type"="http-01" |
@kushwahashiv this isn't an issue for generic support for cert-manager - could you join the #cert-manager channel over on https://slack.k8s.io and we can work through providing support to get this working for you? |
@munnerz Ok I have joined the slack channel. I deleted the whole GKE cluster. let me create the cluster and its deployments etc and then I will. connect in slack if the issue still persists. Thanks for your prompt reply. / Shiv |
Issues go stale after 90d of inactivity. |
I think we can close this one now. There are currently only 2 instances of cert-manager v0.8.x in our top clients (though there are a smattering of other versions showing up). Thanks for all your work on the issue! |
At Let's Encrypt, we've noticed that cert-manager v0.8.0 and v0.8.1 generate excessive traffic under some circumstances. Since we don't have access to the cert-manager installs, we're not sure what those circumstances are. This is a placeholder bug for cert-manager users to provide details of their setup after they've noticed in their logs that cert-manager is sending excessive traffic (more than about 10 requests per day in steady state).
I've noticed two patterns in the logs so far:
Also, I've found that a lot of affected cert-manager users seem to have multiple accounts created, sometimes with multiple independent cert-manager instances running on the same IP (by accident).
If you've noticed this, please list what cert-manager version you are using, plus any details of your Kubernetes setup and how many instances of cert-manager are currently running in your setup.
(This issue is linked to from https://community.letsencrypt.org/t/blocking-old-cert-manager-versions/98753/2, and email we'll send shortly about deprecating older cert-manager versions. Note that even though v0.8 still has some issues, it's still definitely better than previous versions)
The text was updated successfully, but these errors were encountered: