Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cert-manager v0.8.0 and v0.8.1 send excessive traffic #1948

Closed
jsha opened this issue Aug 1, 2019 · 59 comments · Fixed by #2219
Closed

cert-manager v0.8.0 and v0.8.1 send excessive traffic #1948

jsha opened this issue Aug 1, 2019 · 59 comments · Fixed by #2219
Labels
area/acme Indicates a PR directly modifies the ACME Issuer code lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Projects
Milestone

Comments

@jsha
Copy link
Contributor

jsha commented Aug 1, 2019

At Let's Encrypt, we've noticed that cert-manager v0.8.0 and v0.8.1 generate excessive traffic under some circumstances. Since we don't have access to the cert-manager installs, we're not sure what those circumstances are. This is a placeholder bug for cert-manager users to provide details of their setup after they've noticed in their logs that cert-manager is sending excessive traffic (more than about 10 requests per day in steady state).

I've noticed two patterns in the logs so far:

  • cert-manager requests the same certificate over and over again.
  • cert-manager attempts to create a new account (or look up an old account) over and over again.

Also, I've found that a lot of affected cert-manager users seem to have multiple accounts created, sometimes with multiple independent cert-manager instances running on the same IP (by accident).

If you've noticed this, please list what cert-manager version you are using, plus any details of your Kubernetes setup and how many instances of cert-manager are currently running in your setup.

(This issue is linked to from https://community.letsencrypt.org/t/blocking-old-cert-manager-versions/98753/2, and email we'll send shortly about deprecating older cert-manager versions. Note that even though v0.8 still has some issues, it's still definitely better than previous versions)

@n0rad
Copy link

n0rad commented Aug 10, 2019

I received an email from Let's Encrypt an hour ago about this.

I'm actually running 0.9.0 but gone through different versions during setup because none certificates reached ready state (was LB issue).

I had one certificate for norad.fr that stayed pending (on 0.9.0) with http challenge for maybe a week that could have done too many requests.

I'm not sure at which step in logs cert-manager is calling let's encrypt but I found :

k -n cert-manager logs cert-manager-6554467ddb-nbb6d | grep norad.fr | grep 'propagation check failed'
2987

The issue that prevented challenge completion was that my 80 port on norad.fr is hosted by a http redirect server, while I wanted in fact a certificate for www.norad.fr (where 80 port is hosted on kube cluster).

I don't know if that help

@evan-eb
Copy link

evan-eb commented Aug 13, 2019

Running 0.5.2 (old, I know 😓) but no issues with excessive traffic that I can tell. Been running for ~200 days and primarily uses DNS validation.

@kfox1111
Copy link

I just received an email from LetsEncrypt over running an earlier version of cert-manager. I'm glad for the email and the careful handling of the issue. Thank you LetsEncrypt team!

@oliverholliday
Copy link

We just got the email 4am this morning (GMT) and one of our certs expired at 3pm today so - we can't get a re-issue now due to the 503 (perhaps an IP ban?).

Cert-manager v0.7.2.

I restarted the cert-manager pod and it immediately started a renew loop. Not sure how long it's been doing it but presumably quite a while as I believe they default to renew 30 days before expiry?

Logs are below.

I0813 16:14:35.179589       1 controller.go:206] challenges controller: syncing item 'gf2/tls-secret-reader-cryptography-629768480-0'
I0813 16:14:35.182585       1 ingress.go:49] Looking up Ingresses for selector certmanager.k8s.io/acme-http-domain=2657227112,certmanager.k8s.io/acme-http-token=692587981
I0813 16:14:35.187403       1 sync.go:176] propagation check failed: wrong status code '503', expected '200'
I0813 16:14:35.192966       1 controller.go:212] challenges controller: Finished processing work item "gf2/tls-secret-reader-cryptography-629768480-0"
I0813 16:14:45.193168       1 controller.go:206] challenges controller: syncing item 'gf2/tls-secret-reader-cryptography-629768480-0'
I0813 16:14:45.194316       1 ingress.go:49] Looking up Ingresses for selector certmanager.k8s.io/acme-http-domain=2657227112,certmanager.k8s.io/acme-http-token=692587981
I0813 16:14:45.198492       1 sync.go:176] propagation check failed: wrong status code '503', expected '200'
I0813 16:14:45.198572       1 controller.go:212] challenges controller: Finished processing work item "gf2/tls-secret-reader-cryptography-629768480-0"
I0813 16:14:55.198764       1 controller.go:206] challenges controller: syncing item 'gf2/tls-secret-reader-cryptography-629768480-0'
I0813 16:14:55.199461       1 ingress.go:49] Looking up Ingresses for selector certmanager.k8s.io/acme-http-domain=2657227112,certmanager.k8s.io/acme-http-token=692587981
I0813 16:14:55.204099       1 sync.go:176] propagation check failed: wrong status code '503', expected '200'
I0813 16:14:55.204262       1 controller.go:212] challenges controller: Finished processing work item "gf2/tls-secret-reader-cryptography-629768480-0"
I0813 16:15:05.204370       1 controller.go:206] challenges controller: syncing item 'gf2/tls-secret-reader-cryptography-629768480-0'
I0813 16:15:05.205528       1 ingress.go:49] Looking up Ingresses for selector certmanager.k8s.io/acme-http-domain=2657227112,certmanager.k8s.io/acme-http-token=692587981
I0813 16:15:05.238805       1 sync.go:176] propagation check failed: wrong status code '503', expected '200'
I0813 16:15:05.238841       1 controller.go:212] challenges controller: Finished processing work item "gf2/tls-secret-reader-cryptography-629768480-0"
I0813 16:15:15.238973       1 controller.go:206] challenges controller: syncing item 'gf2/tls-secret-reader-cryptography-629768480-0'
I0813 16:15:15.239829       1 ingress.go:49] Looking up Ingresses for selector certmanager.k8s.io/acme-http-domain=2657227112,certmanager.k8s.io/acme-http-token=692587981
I0813 16:15:15.242087       1 sync.go:176] propagation check failed: wrong status code '503', expected '200'
I0813 16:15:15.242114       1 controller.go:212] challenges controller: Finished processing work item "gf2/tls-secret-reader-cryptography-629768480-0"
I0813 16:15:25.242310       1 controller.go:206] challenges controller: syncing item 'gf2/tls-secret-reader-cryptography-629768480-0'
I0813 16:15:25.243184       1 ingress.go:49] Looking up Ingresses for selector certmanager.k8s.io/acme-http-domain=2657227112,certmanager.k8s.io/acme-http-token=692587981
I0813 16:15:25.248178       1 sync.go:176] propagation check failed: wrong status code '503', expected '200'
I0813 16:15:25.248667       1 controller.go:212] challenges controller: Finished processing work item "gf2/tls-secret-reader-cryptography-629768480-0"
I0813 16:15:35.248809       1 controller.go:206] challenges controller: syncing item 'gf2/tls-secret-reader-cryptography-629768480-0'
I0813 16:15:35.249092       1 ingress.go:49] Looking up Ingresses for selector certmanager.k8s.io/acme-http-domain=2657227112,certmanager.k8s.io/acme-http-token=692587981
I0813 16:15:35.252109       1 sync.go:176] propagation check failed: wrong status code '503', expected '200'

@rnkhouse
Copy link

This issue is not allowing to update the version: #1255

@lucasces
Copy link

Hello, updated mine today to v0.9.0.
While checking if everthing was in place, I noticed that cert-manager has no back-off mechanism to deal with misconfigured certificates. In my case it keep trying do verify a domain that is not mine every minute since 17:50 pm until 00:40 am, when I deleted de misbehaving certificate. If the log are accurate, it called GetOrder, GetAuthorization and HTTP01ChallengeResponse 1641 times during that period.
my log for that period

@RaduRaducuIlie
Copy link

RaduRaducuIlie commented Aug 14, 2019

I tried to use v.0.9.1, but for some reason it will issue a "Temporary certificate" and for some time this certificate that is not trusted it will be available. Can this be avoided(for older versions this didn't happen)? Thank you!
Events for v.0.6.2:

Events:
Type Reason Age From Message


Normal Generated 2m cert-manager Generated new private key
Normal OrderCreated 2m cert-manager Created Order resource "order-name-2441368539"
Normal OrderComplete 1m cert-manager Order "order-name-2441368539" completed successfully
Normal CertIssued 1m cert-manager Certificate issued successfully

Events for version>v.0.8.1:

Normal Generated 16s cert-manager Generated new private key
Normal GenerateSelfSigned 16s cert-manager Generated temporary self signed certificate
Normal OrderCreated 16s cert-manager Created Order resource "order-name-2441368539"
Normal OrderComplete 13s cert-manager Order "order-name-2441368539" completed successfully
Normal CertIssued 13s cert-manager Certificate issued successfully

Is there anyway to exclude the step "Generated temporary self signed certificate"?
Thank you!

@rnkhouse
Copy link

rnkhouse commented Aug 14, 2019

@RaduRaducuIlie How did you install v0.9.1?
It's giving me this error: Error: failed to download "stable/cert-manager" (hint: running helm repo update may help) on AKS cluster.

@ryangrahamnc
Copy link

I'm on 0.9.0. Cleaned up some test ingresses I had lying around. And noticed (afterwards) that I had a CertIssued event run 1662301 times in the past 4 days. I suspect its because I had two ingresses fighting over the same tls secret, but I'm not completely certain, as I didnt check out the event count until after I deleted them. The events themselves had the message "Certificate issued successfully"

@RaduRaducuIlie
Copy link

@rnkhouse,cert-manager is installed wi5h istio and I just updated the image tag for cert-manager.

@AndresPineros
Copy link

Hello,

I want to help with my metrics but I don't know how to count the requests. It would be great if you gave us:

  1. A simple script to run on top of our logs that counts the amount of requests.
  2. An average value or formula to know if we're actually generating excessive traffic.
  3. A format to publish whatever you need to debug this if we actually find we have excessive traffic.

If you post this, I'm pretty sure you'll get a lot of feedback :) (including mine)

@aparaschiv
Copy link

I've encountered the second pattern on a fresh install of cert-manager v0.9.1 on k8s v1.15.2
cert-manager.log

@cpu
Copy link
Contributor

cpu commented Aug 22, 2019

@munnerz Is anyone from Jetstack planning to engage with this issue? It's somewhat disheartening to see affected users reporting in (some with data about the problem, some asking for help collecting that data) and no one from Jetstack has replied yet. There are a few posts mentioning the current most version having a pattern of excessive traffic that may lead to Let's Encrypt having to block that version as well.

@munnerz
Copy link
Member

munnerz commented Aug 22, 2019

@cpu @JoshVanL has been looking through logs to try and find anything suspect - we've also been discussing with others on Slack and gathering info too.

Is there any data on the percentage of unique accounts this is affecting? i.e. N% of accounts registered using cert-manager are showing abusive traffic patterns?

@cpu
Copy link
Contributor

cpu commented Aug 22, 2019

@munnerz @JoshVanL Great, glad to hear that this is on your radar. Do you have any advice for @anderspetersson's questions?

It also sounds like @aparaschiv was able to reproduce this from a brand new install. Could you collaborate with them to reproduce the problem?

Is there any data on the percentage of unique accounts this is affecting? i.e. N% of accounts registered using cert-manager are showing abusive traffic patterns?

Our log analysis platform is not particularly well suited to answering questions like this about a proportion of UAs that meet some other 2nd level criteria like request volume. I'll ask internally to see if we can pull this somehow.

@munnerz
Copy link
Member

munnerz commented Aug 22, 2019

@aparaschiv

E0815 21:03:22.296093 1 base_controller.go:189] cert-manager/controller/clusterissuers "msg"="re-queuing item due to error processing" "error"="Timeout: request did not complete within requested timeout 30s" "key"="letsencrypt-prod"

From your logs, this looks to be the error you're talking about. It is expected that we'd retry this kind of error as it's a timeout completing a request with the ACME server - this indicates either a network issue, or some other problem access the Let's Encrypt API.

Looking at the timestamps, it seems like the exponential back-off is being applied correctly (specified here: https://github.com/jetstack/cert-manager/blob/582371a1db8469710437b3900bf533c3b3bdffb6/pkg/controller/util.go#L38):

E0815 21:03:22.296093
E0815 21:03:52.994624
E0815 21:04:23.710262
E0815 21:04:54.452585
E0815 21:05:34.482711

That said, I do notice you have this error at the end:

E0815 21:08:15.247152       1 base_controller.go:189] cert-manager/controller/clusterissuers "msg"="re-queuing item  due to error processing" "error"="Internal error occurred: failed calling webhook \"clusterissuers.admission.certmanager.k8s.io\": the server is currently unable to handle the request" "key"="letsencrypt-prod" 

which indicates that the 'webhook' component has not started correctly, which will also cause issues persisting data (which will cause us to retry and apply exponential backoff).

That said, from my understanding the types of abusive traffic patterns we are looking for are more than 1 request every 5 minutes, and more in the region of multiple requests per second

@AndresPineros

We expose ACME client library Prometheus metrics which can be used to identify abusive traffic patterns - from there, a full copy of your logs would be appreciated. The prometheus metrics we expose also contain the response status code from the ACME server:

certmanager_http_acme_client_request_count{host="acme-v02.api.letsencrypt.org",method="GET",path="/directory",scheme="https",status="200"} 1
certmanager_http_acme_client_request_count{host="acme-v02.api.letsencrypt.org",method="GET",path="/directory",scheme="https",status="999"} 1
certmanager_http_acme_client_request_count{host="acme-v02.api.letsencrypt.org",method="HEAD",path="/acme/new-nonce",scheme="https",status="200"} 3
certmanager_http_acme_client_request_count{host="acme-v02.api.letsencrypt.org",method="POST",path="/acme/authz",scheme="https",status="200"} 6
certmanager_http_acme_client_request_count{host="acme-v02.api.letsencrypt.org",method="POST",path="/acme/cert",scheme="https",status="200"} 2
certmanager_http_acme_client_request_count{host="acme-v02.api.letsencrypt.org",method="POST",path="/acme/challenge",scheme="https",status="200"} 4
certmanager_http_acme_client_request_count{host="acme-v02.api.letsencrypt.org",method="POST",path="/acme/finalize",scheme="https",status="200"} 2
certmanager_http_acme_client_request_count{host="acme-v02.api.letsencrypt.org",method="POST",path="/acme/new-acct",scheme="https",status="200"} 1
certmanager_http_acme_client_request_count{host="acme-v02.api.letsencrypt.org",method="POST",path="/acme/new-order",scheme="https",status="201"} 2
certmanager_http_acme_client_request_count{host="acme-v02.api.letsencrypt.org",method="POST",path="/acme/new-order",scheme="https",status="400"} 2
certmanager_http_acme_client_request_count{host="acme-v02.api.letsencrypt.org",method="POST",path="/acme/order",scheme="https",status="200"} 6

@cpu
Copy link
Contributor

cpu commented Aug 22, 2019

I had a CertIssued event run 1662301 times in the past 4 days. I suspect its because I had two ingresses fighting over the same tls secret, but I'm not completely certain, as I didnt check out the event count until after I deleted them.

This observation from @ryangrahamnc also seems like a promising avenue for debugging.

@oliverholliday
Copy link

I'd be very surprised if the issue I had (logs posted above) was due to two ingresses referencing the same secret as we have only ever seen this once on one certificate and all our resources are formulaic, automated by pulumi and never shared. We disable pulumi autonaming for ingresses and services so there shouldn't be a way for that to have happened, for us.

@munnerz
Copy link
Member

munnerz commented Aug 22, 2019

@ryangrahamnc we added a check for this a little while ago: https://github.com/jetstack/cert-manager/blob/582371a1db8469710437b3900bf533c3b3bdffb6/pkg/controller/certificates/sync.go#L131-L148 this was first in a release in v0.9.0: #1689

Can you share your log messages as it's odd that you're seeing this... 😬

@mikkelfj
Copy link

@jsha I got an email on pre 0.8 cert managers being blocked. However, for some reason the email is now without content in my email system so I am not sure of the exact content - was it recalled?

We downgraded to cert-manager 0.7 a while back because of this infinite loop problem and it solved the problem in our case on both Azure and Google cloud. It would be unfortunate if we are forced to upgrade to an unstable cert-manager version since apparently the issue hasn't been fixed.

Does this mean we need to find some solution other than cert-manager?

@cpu
Copy link
Contributor

cpu commented Aug 26, 2019

@mikkelfj the content of the email was also shared in a community forum thread if you need to reference a copy.

@munnerz
Copy link
Member

munnerz commented Aug 28, 2019

@mikkelfj could you also share your log messages using 0.9.1 to help us dig into this for you? 😀

@mikkelfj
Copy link

I don't recall what version we were running before rolling back to 0.7, it was probably 6-12 months ago. We are currently running 0.7 reliably and that is all we have logs for. The above discussion suggests that the latest version is still not stable. As to 0.7 it works for us, but now I think about it, perhaps I did see some suspect log content a while back but at least we get new certs for now.

If I set up a 0.9.1, I'll let you know how it goes.

@cromefire
Copy link

After upgrading from 0.8.1 to 0.9.1 my logs mentioned something about

Operation cannot be fulfilled on certificates.certmanager.k8s.io "some-cert": StorageError: invalid object, Code: 4

Unfortunately I don't have the logs anymore and won't try to upgrade again (I tried it like 10 times) because I'm trying to recover from hitting the rate-limit (which resulted in quite critical infrastructure being unavailable)

I don't know if it's related, but I'd configured 1 ingress with 2 domains and the same secret

@munnerz
Copy link
Member

munnerz commented Oct 4, 2019

I've opened #2155 to update this ^ 😄

@jsha
Copy link
Contributor Author

jsha commented Oct 4, 2019

Thanks for the update @munnerz ! I also hope that helps. However, I think it's still not quite enough. What I'd really like to see is a system where, even if someone manages to bypass the guardrails you're adding and run two instances of cert-manager, it doesn't go into pathological traffic mode.

So, for instance, you mentioned that you think the problem is due to the two instances overwriting a single Resource. Can you make it so that each instance names its Resources randomly so that there is very little chance multiple instances will be contending over one?

@anebi
Copy link

anebi commented Oct 9, 2019

Hello,

I am using Rancher v2 and I am currios if we should upgrade cert-manager using the official repo posted here in comments or the update stuff will be available also via https://github.com/helm/charts/tree/master/stable/cert-manager?

@anebi
Copy link

anebi commented Oct 10, 2019

@munnerz
Copy link
Member

munnerz commented Oct 11, 2019

So, for instance, you mentioned that you think the problem is due to the two instances overwriting a single Resource. Can you make it so that each instance names its Resources randomly so that there is very little chance multiple instances will be contending over one?

Given the way that Kubernetes controllers work, this isn't really possible. These resources are created and named by end-users, not just by cert-manager. Some resources (i.e. Orders) are created by cert-manager in response to 'user actions', but there's no reliable way for us to shard processing in the way you describe without potentially ending out in situations where no instances will process the resource.

The very decoupled nature of Kubernetes is designed around the idea that different actors can modify/manipulate resources, which aids extensibility, however if a user runs two controllers that 'compete' with each other, you've effectively got a situation where one person is turning the heating on, whilst someone else is continuously turning the heating off.

Leader election et al is meant to address this sort of thing to ensure only one instance runs at a time. When users run multiple instances (and worse, when these instances have mismatched versions), it's effectively like running a concurrency sensitive application without any locks.

We've now made the change (and it's rolled out to v0.11) to make it harder to actually configure things in this way (it was too easy in the past), so I'm keen to see how the results look there.

Given that you're seeing approx. 2% of accounts express abusive traffic patterns, I still think that these instances are down to misconfigurations/bad deployments (and also the issues like you describe in #2194) - I am confident we can continue to reduce this number with on-going changes, and I think you'd agree compared to a few months ago on earlier releases, we've managed to reduce the total % of abusive accounts fairly significantly (previously, I do believe we had a far higher proportion of our users with abusive patterns).

Happy to set up a call or any other kind of chat to go over it in a bit more depth 😄 I appreciate this isn't the simplest concept, and it's a bit tricky to explain it all here 😅

@jsha
Copy link
Contributor Author

jsha commented Oct 11, 2019

These resources are created and named by end-users, not just by cert-manager. Some resources (i.e. Orders) are created by cert-manager in response to 'user actions', but there's no reliable way for us to shard processing in the way you describe without potentially ending out in situations where no instances will process the resource.

It sounds like "resources" is probably the wrong abstraction for cert-manager to store its internal state in. What if cert-manager stored its internal state on disk in its container? I understand cert-manager may want to make a certificate resource available so other components (like Nginx) can consume it, but cert-manager could treat the certificate resource as output-only, treating its on-disk state as authoritative.

BTW, I tried to look up "resources" in the Kubernetes documentation but didn't find something that seemed to match the concept here. Are we talking about Kubernetes Objects?

Given that you're seeing approx. 2% of accounts express abusive traffic patterns, I still think that these instances are down to misconfigurations/bad deployments

I think you're probably right that misconfiguration is the cause of this excessive traffic, but it's a very common misconfiguration, and I can see why - it seems like it's easy in Kubernetes to lose track of the fact that you've already got a cert-manager instance deployed. Even if it were a rare misconfiguration, it would be important that cert-manager fail cleanly, sending zero traffic rather than sending thousands of times more traffic than normal. While only 2% of cert-manager instances sent high traffic, at times those instances represented 40% of all Let's Encrypt API requests.

I think you'd agree compared to a few months ago on earlier releases, we've managed to reduce the total % of abusive accounts fairly significantly (previously, I do believe we had a far higher proportion of our users with abusive patterns).

Yes, I think cert-manager has made a ton of great progress in recent versions. I really appreciate your work on this! I want to get to the point where 0% of cert-manager clients are abusive, and I think we can get there, but it will probably take some significant design changes.

@kfox1111
Copy link

I think the new locking changes will help significantly.

I don't think it can ever be reduced to 0%. Any software can be abused. What is reasonable to do is have 0% of it be accidental, but instead, all remaining is malicious.

I think one of the next remaining checks would be to ensure that if there was a cluster-wide cert-manager installed, that a namespace only one wouldn't start.

@munnerz
Copy link
Member

munnerz commented Oct 15, 2019

After some more careful consideration on this, I think the v0.11 release will significantly improve this due to the change we made to use the status subresource on our CRDs (#2097). This change will mean that any old version of cert-manager, when attempting to persist its new state, will not be able to persist their old state data and thus, will not interfere with new versions of cert-manager still running.

This should massively help, as it'll prevent the 'fighting' behaviour, meaning that newer releases will operate just fine. The older release is likely to sit and not do much (depending on the version), as it won't be able to observe its own state changes and so, won't re-sync the resource.

To further insulate us from issues like this in future, I've also opened #2219 which will go a step further and make the ACME Order details immutable once set on our Order resources. This should, once again, prevent fighting as these values will no longer be able to 'flip-flop'. In the event that two controllers do start to do this, the apiserver will actually reject changes to these fields, which will cause a 4xx error to be returned to the UpdateStatus call, which in turn will trigger exponential back-off (and avoid querying ACME in a tight loop!)

The above 2, plus the leader election changes, I believe will resolve this issue altogether.

I don't think it can ever be reduced to 0%. Any software can be abused. What is reasonable to do is have 0% of it be accidental, but instead, all remaining is malicious.

Yes - agreed.

I think one of the next remaining checks would be to ensure that if there was a cluster-wide cert-manager installed, that a namespace only one wouldn't start.

This is a difficult heuristic to develop IMO. That said, with the new leader election changes, if a user tries to deploy a cluster scoped version of cert-manager as well as a namespace scoped version, they will both have the same leader election namespace set (unless the user explicitly changes it), which will mean they won't 'compete'.

I'd be more in favour of supporting 'namespace scoped cert-manager' as a first-class feature, and then having a --set namespaceToWatch=abc (naming is hard) which would set the leader election namespace as well as disabling any non-namespaced controllers.

Relevant to this is the discussions we've had in the past about switching to controller-runtime, which has better support for running informers against multiple namespaces at once. But this is starting to veer far off the original topic, so I'll not go into too much detail here 😄

@munnerz
Copy link
Member

munnerz commented Oct 15, 2019

@jsha regarding cert-manager state, our interactions with other tools in the ecosystem etc., I'd be happy to set up a quick call to go over some of these details. I appreciate your suggestions, but I don't think it is fair that due to a number of users that have misconfigured older and newer clients, we should significantly re-architect the entire project.

BTW, I tried to look up "resources" in the Kubernetes documentation but didn't find something that seemed to match the concept here. Are we talking about Kubernetes Objects?

Yes

I think you're probably right that misconfiguration is the cause of this excessive traffic, but it's a very common misconfiguration, and I can see why - it seems like it's easy in Kubernetes to lose track of the fact that you've already got a cert-manager instance deployed.

I am not sure if it's fair to say it's easy to lose track - I think some users do it by accident, similar to how some users may install two copies of the same application on their own computer. Kubernetes is a powerful tool, but it must be used properly (and changes to our leader election config will help users to not burn themselves here).

Even if it were a rare misconfiguration, it would be important that cert-manager fail cleanly, sending zero traffic rather than sending thousands of times more traffic than normal.

👍 - agreed, and I think we've made some significant changes in v0.11 (and also, v0.12), that are mentioned above. I'm hopeful that this will quash that remaining 2%, and I believe that if you dig into the numbers, you'll observe far fewer users of v0.11 and v0.12 that are showing abusive traffic patterns whilst also running an older version.

I want to get to the point where 0% of cert-manager clients are abusive, and I think we can get there, but it will probably take some significant design changes.

I think we can get here too 😄 (although excluding users who intentionally are trying to circumvent the rules/cause problems). That said, I am fairly confident it won't take significant design changes 😅

@jsha
Copy link
Contributor Author

jsha commented Oct 16, 2019

the v0.11 release will significantly improve this due to the change we made to use the status subresource on our CRDs (#2097).
I've also opened #2219 which will go a step further and make the ACME Order details immutable once set on our Order resources.

These both look like really positive changes (though I'll admit to not fully understanding how #2097 works). I'm optimistic these will further reduce the problem.

due to a number of users that have misconfigured older and newer clients, we should significantly re-architect the entire project.

I think it depends on how serious you consider this class of bugs, and whether you think it's the user's fault when they hit them. I've tended to consider this an issue with the software rather than the user, because every user I've reached out to doesn't realize what's going on - there's no good way for them to notice.

A big part of why I consider this class of bugs to be serious is that it's non-linear. Yes, a user can always make a mistake and install two copies of a program; that would typically use twice the resources. But under our current understanding, installing two copies of cert-manager can result in 100,000-1,000,000 times as many requests as installing just one copy (based on an expected "normal" traffic of 10 requests per renewal period, or more generously 10 requests per day).

It's not clear to me how big a reorganization it would be to move to internal storage; it may be prohibitive. I'd be curious to hear more. My intuition that it's worthwhile is because, so far, a series of fixes to address specific symptoms haven't succeeded in fully addressing the problem. Usually that means that the problems need a more structural approach.

I'd be happy to set up a quick call to go over some of these details

Thanks! I'll send you an email to schedule.

@munnerz
Copy link
Member

munnerz commented Oct 28, 2019

... 'helpful' github automation closed this issue - re-opening it so that we can explicitly close it when we're happy 😄

@munnerz
Copy link
Member

munnerz commented Oct 28, 2019

(to clarify, the changes in #2219, and various others, should definitely help significantly in those cases where users are running multiple instances of cert-manager with leader election not properly enabled, but we should wait for some kind of statistical validation of that first!)

@kushwahashiv
Copy link

kushwahashiv commented Oct 29, 2019

Hi,

Im gettign following error for my GKE cluster. My domain is registered in godaddy.com
My question what is causing below error and is this something I need to do on gopdaddy.com so that these well-known thing can be verified

cert-manager/controller/challenges "msg"="propagation check failed" "error"="failed to perform self check GET request 'http://www.abc.com/.well-known/acme-challenge/AZl_evY1PscNKi95EdfFNuYG_Gl75-Hi8we7Efbyy7I': Get http://www.abc/.well-known/acme-challenge/AZl_evY1PscNKi95E3EFNxcdfl75-Hi8we7Efbyy7I: dial tcp: lookup www.abc.in on 10.12.244.10:53: no such host" "dnsName"="www.abc.com" "resource_kind"="Challenge" "resource_name"="abc-3611830638-3386842356-1600275291" "resource_namespace"="default" "type"="http-01"

@munnerz
Copy link
Member

munnerz commented Oct 29, 2019

@kushwahashiv this isn't an issue for generic support for cert-manager - could you join the #cert-manager channel over on https://slack.k8s.io and we can work through providing support to get this working for you?

@kushwahashiv
Copy link

@munnerz Ok I have joined the slack channel. I deleted the whole GKE cluster. let me create the cluster and its deployments etc and then I will. connect in slack if the issue still persists. Thanks for your prompt reply.

/ Shiv

@retest-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to jetstack.
/lifecycle stale

@jetstack-bot jetstack-bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 27, 2020
@jsha
Copy link
Contributor Author

jsha commented Jan 27, 2020

I think we can close this one now. There are currently only 2 instances of cert-manager v0.8.x in our top clients (though there are a smattering of other versions showing up). Thanks for all your work on the issue!

@jsha jsha closed this as completed Jan 27, 2020
v0.12 automation moved this from In progress to Done Jan 27, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/acme Indicates a PR directly modifies the ACME Issuer code lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Projects
No open projects
v0.12
  
Done
Development

Successfully merging a pull request may close this issue.