Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Order stuck in errored state #4441

Closed
rittneje opened this issue Sep 9, 2021 · 7 comments
Closed

Order stuck in errored state #4441

rittneje opened this issue Sep 9, 2021 · 7 comments
Labels
triage/support Indicates an issue that is a support question.

Comments

@rittneje
Copy link

rittneje commented Sep 9, 2021

Describe the bug:

We have a certificate that is supposed to be refreshed with Let's Encrypt once a week. Back in June, the Order failed mysteriously:

Failed to finalize Order: 403 urn:ietf:params:acme:error:orderNotReady: Order's status ("valid") is not acceptable for finalization

The Order then remained in the "errored" state for about 3 months, until the certificate itself finally expired. It seems that it is incorrectly trying to reuse the broken Order instead of starting a new one.

This particular certificate had been renewed many times previously. We did notice similar issues with other CertificateRequests/Orders, but not all the same time. Nonetheless, the results were identical. If an Order fails due to some transient issue, cert-manager incorrectly tries to reuse that broken CertificateRequest forever rather than making a new one, and then eventually the Certificate expires.

The only way to fix it is to manually delete the offending CertificateRequest, and then wait an hour for it to try again.

Expected behaviour:

It should have fixed itself automatically without any manual intervention.

Steps to reproduce the bug:

We really don't know what specifically caused the Order to fail in the first place, but once it does, this will happen.

Anything else we need to know?:

Environment details::

  • Kubernetes version: 1.18
  • Cloud-provider/provisioner: AWS EKS
  • cert-manager version: 1.2.0
  • Install method: kubectl apply

/kind bug

@jetstack-bot jetstack-bot added the kind/bug Categorizes issue or PR as related to a bug. label Sep 9, 2021
@irbekrm
Copy link
Collaborator

irbekrm commented Sep 10, 2021

This should have been fixed in #4130 which was released in v1.5.0- would you be able to upgrade and let us know if it got fixed?

@rittneje
Copy link
Author

@irbekrm Sure we can check after our next upgrade cycle for cert-manager.

@jakexks
Copy link
Member

jakexks commented Oct 7, 2021

Please re-open if you experience the issue when cert-manager is up to date.
/remove-kind bug
/triage support
/close

@jetstack-bot jetstack-bot added triage/support Indicates an issue that is a support question. and removed kind/bug Categorizes issue or PR as related to a bug. labels Oct 7, 2021
@jetstack-bot
Copy link
Collaborator

@jakexks: Closing this issue.

In response to this:

Please re-open if you experience the issue when cert-manager is up to date.
/remove-kind bug
/triage support
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@PSanetra
Copy link

@irbekrm I guess this is a duplicate of #2765 and does not seem to be fixed by 1.6.1

@irbekrm
Copy link
Collaborator

irbekrm commented Nov 29, 2021

@irbekrm I guess this is a duplicate of #2765 and does not seem to be fixed by 1.6.1

From a brief look #2765 is about not retrying to finalize orders that have already been finalized.

The issue described here is also caused by retrying to finalize already finalized Orders- but creating a new Order after the previous one errored due to the repeated attempt to finalize may be a sufficient solution here as that should at least ensure that the failed certificate requests are retried (by creating a new CertificateRequest after the backoff period).

does not seem to be fixed by 1.6.1

Do you have some logs that you could add to #2765 ? Also the status of the Orders/CertificateRequests etc would be useful as well as what were you trying to achieve (why did you need to run kubectl cert-manager -n my-namespace renew my-cert manually?).

@PSanetra
Copy link

@irbekrm Sorry, I don't have the exact logs and resource states anymore. The issue is now resolved for us.

why did you need to run kubectl cert-manager -n my-namespace renew my-cert manually?

I needed to run this command as we have set the preferredChain option on the cluster issuer so that we always get the ISRG Root X1 chain. The certificates with the old chain was still considered valid by cert-manager, therefore we needed to run the renew command.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
triage/support Indicates an issue that is a support question.
Projects
None yet
Development

No branches or pull requests

5 participants