Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Challenge Records Not Always Cleaned Up #3640

Open
Evesy opened this issue Feb 8, 2021 · 28 comments
Open

Challenge Records Not Always Cleaned Up #3640

Evesy opened this issue Feb 8, 2021 · 28 comments
Labels
area/acme Indicates a PR directly modifies the ACME Issuer code kind/bug Categorizes issue or PR as related to a bug. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete.
Milestone

Comments

@Evesy
Copy link
Contributor

Evesy commented Feb 8, 2021

Describe the bug:
After successfully completing dns-01 challenges, cert-manager is not always cleaning up the TXT records it created

Expected behaviour:
All DNS records related to challenges should be deleted once completed.

Steps to reproduce the bug:
TBC.

I currently cannot consistently reproduce the issue

Anything else we need to know?:

Environment details::

  • Kubernetes version: v1.17.14-gke.1600
  • Cloud-provider/provisioner: GKE
  • cert-manager version: 1.1.0
  • Install method: Custom helm chart

The issue only seems to affect challenge records provisioned in Google Cloud DNS, we don't see the same thing for Cloudflare DNS (Though about 95% of challenges are via Cloud DNS)

I can see in the GCP logging for one example the API requests to create the record, but no requests to later delete the record.

/kind bug

@jetstack-bot jetstack-bot added the kind/bug Categorizes issue or PR as related to a bug. label Feb 8, 2021
@maelvls
Copy link
Member

maelvls commented Feb 9, 2021

Hi! It looks like a bug during the "finalizer" stage; when the issue happens, would you be able to share the cert-manager-controller logs?

/triage needs-information

@jetstack-bot jetstack-bot added the triage/needs-information Indicates an issue needs more information in order to work on it. label Feb 9, 2021
@Evesy
Copy link
Contributor Author

Evesy commented Feb 9, 2021

Hi! It looks like a bug during the "finalized" stage; when the issue happens, would you be able to share the cert-manager-controller logs?

/triage needs-information

Absolutely. I've increased the logging around cert-manager and will grab a copy of the logs the next time it happens

@Evesy
Copy link
Contributor Author

Evesy commented Feb 11, 2021

Hey @maelvls -- Logs are here. Best I could do was a csv as they were exported from Kibana, cheers

@maelvls
Copy link
Member

maelvls commented Feb 23, 2021

After almost 30 minutes into investigating the logs, I realized that I was looking at anti-chronological entries 😅

I then was surprised by the absence of a line that would say "finalizer" (something like controller/challenges/finalizer). The removal of the TXT records happens in acmechallenges/sync.go, and it seems like the Challenge object never gets deleted maybe?

The challenge itself seems to be properly deleted (I mean, metadata.deletionTime becomes non-null):

sync.go:101] controller/orders msg="Order has already been completed, cleaning up any owned Challenge resources" resource_kind="Order" resource_name="sauron-adverts-evo-app-tls-78s5d-3403441770" "resource_namespace"="sauron-adverts-evo-app" "resource_version"="v1"
round_trippers.go:443] DELETE https://10.192.0.1:443/apis/acme.cert-manager.io/v1/namespaces/sauron-adverts-evo-app/challenges/sauron-adverts-evo-app-tls-78s5d-3403441770-1727866623 200 OK in 4 milliseconds

Not sure why the finalizer logs don't show :(

@Evesy
Copy link
Contributor Author

Evesy commented Aug 9, 2021

Hey, is there any more information you need on this? We're still seeing a quite a lot of challenge records left around after the certificate issuance.

Happy to collect anything that'd be useful to help debug

@jetstack-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to jetstack.
/lifecycle stale

@jetstack-bot jetstack-bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 7, 2021
@Evesy
Copy link
Contributor Author

Evesy commented Nov 8, 2021

/remove-lifecycle stale

@jetstack-bot jetstack-bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 8, 2021
@jetstack-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to jetstack.
/lifecycle stale

@jetstack-bot jetstack-bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 6, 2022
@Evesy
Copy link
Contributor Author

Evesy commented Feb 8, 2022

/remove-lifecycle stale

This is still occurring as of 1.6

@jetstack-bot jetstack-bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 8, 2022
@wallrj wallrj added the area/acme Indicates a PR directly modifies the ACME Issuer code label Apr 28, 2022
@wallrj
Copy link
Member

wallrj commented May 10, 2022

I've been looking at the code and noticed a few problems and potential cleanups:

@jetstack-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to jetstack.
/lifecycle stale

@jetstack-bot jetstack-bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 12, 2022
@Evesy
Copy link
Contributor Author

Evesy commented Aug 12, 2022

/remove-lifecycle stale

@jetstack-bot jetstack-bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 12, 2022
@jetstack-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to jetstack.
/lifecycle stale

@jetstack-bot jetstack-bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 10, 2022
@Evesy
Copy link
Contributor Author

Evesy commented Nov 11, 2022

/remove-lifecycle stale

@jetstack-bot jetstack-bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 11, 2022
@Evesy
Copy link
Contributor Author

Evesy commented Jan 3, 2023

@wallrj Hi, are there any plans to continue with the open PR to progress towards a fix for challenge records not always being cleaned up?

@irbekrm irbekrm removed the triage/needs-information Indicates an issue needs more information in order to work on it. label Feb 14, 2023
@irbekrm irbekrm added the priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. label Feb 14, 2023
@jetstack-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to jetstack.
/lifecycle stale

@jetstack-bot jetstack-bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 15, 2023
@mecseid
Copy link

mecseid commented May 15, 2023

/remove-lifecycle stale

I run into the same issue with DigitalOcean DNS services, which contains a lot of TXT record for the DNS challenge.

@jetstack-bot jetstack-bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 15, 2023
@maaft
Copy link

maaft commented Jul 21, 2023

same issue here, also with DigitalOcean (didn't try other DNS services) ! It's a bit annoying.

@jetstack-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to jetstack.
/lifecycle stale

@jetstack-bot jetstack-bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 19, 2023
@Evesy
Copy link
Contributor Author

Evesy commented Oct 20, 2023

/remove-lifecycle stale

@jetstack-bot jetstack-bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 20, 2023
@wallrj wallrj self-assigned this Nov 1, 2023
@mycarrysun
Copy link

Are there any updates on this? We're experiencing the same behavior in 1.13.3 with the azureDNS solver, but only with delegated domains. The regular subdomains in the same dns zone are cleaned up like normal.

@D3CK3R
Copy link

D3CK3R commented Jan 19, 2024

Any update here?

@smeng9
Copy link

smeng9 commented Jan 24, 2024

The digital ocean TXT records keep piling up.

After several renews the TXT records gets too large which exceeds max response size and lets encrypt refuses to parse it https://community.letsencrypt.org/t/max-response-size-for-dns-01/122700/6

Is there a solution to the TXT records clean up issue?

@D3CK3R
Copy link

D3CK3R commented Jan 24, 2024

Any simple workaround for this? We have hundreds of records in our DNS

@wallrj wallrj added this to the 1.15 milestone Feb 3, 2024
@wallrj wallrj removed their assignment Feb 3, 2024
@cert-manager-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
/lifecycle stale

@cert-manager-prow cert-manager-prow bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 3, 2024
@mycarrysun
Copy link

/remove-lifecycle stale

@cert-manager-prow cert-manager-prow bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 3, 2024
@inteon inteon modified the milestones: 1.15, 1.16 May 14, 2024
@Routhinator
Copy link

This is a problem, the behaviour here leads to issues with rate limiting as different DNS automations like cert-manager and external-dns have to perform more and more queries to check all pages.

@pre
Copy link

pre commented Aug 21, 2024

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/acme Indicates a PR directly modifies the ACME Issuer code kind/bug Categorizes issue or PR as related to a bug. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete.
Projects
None yet
Development

No branches or pull requests