-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
2686 fixes stuck Orders issue #3805
Conversation
fda3f6e
to
6769ca6
Compare
/retest |
A lot of setup failures lately, it seems. |
6769ca6
to
bd31d3f
Compare
/milestone v1.4 |
bd31d3f
to
e79d6c1
Compare
e79d6c1
to
3812c61
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @irbekrm
Really nice refactoring to slot in the scheduled work queue.
And thanks for adding that integration test.
I left one or two comments and suggestions below which you can answer or address as you like.
Thank you for the code review and good suggestions @wallrj ! I think I have addressed all and created new issues for some, please take a look. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me @irbekrm
Just a typo and a suggestion which you can ignore and /unhold
if you prefer.
/lgtm
/hold
ca1c348
to
bff1aab
Compare
Thanks for the review @wallrj ! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
/lgtm
/hold cancel |
To avoid stuck Orders in case of a misbehaving ACME server Signed-off-by: irbekrm <irbekrm@gmail.com>
To allow for testing whether an item gets re-queued in unit tests Signed-off-by: irbekrm <irbekrm@gmail.com>
Signed-off-by: irbekrm <irbekrm@gmail.com>
So that it easier used with the existing test framework and also is more similar to how most other controllers are created Signed-off-by: irbekrm <irbekrm@gmail.com>
Signed-off-by: irbekrm <irbekrm@gmail.com>
Signed-off-by: irbekrm <irbekrm@gmail.com>
bff1aab
to
06f6b46
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: irbekrm, wallrj The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@@ -144,10 +152,25 @@ func (c *controller) Sync(ctx context.Context, o *cmacme.Order) (err error) { | |||
return err | |||
} | |||
|
|||
acmeOrder, err := getACMEOrder(ctx, cl, o) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just stumbled across this - so one thing to note is that this controller (and all other acme controllers) have always made a point to avoid calling the ACME API unless it is strictly necessary. This is because, in instances where unexpected errors can crop up (or a bug in our code), it can lead to large amounts of requests being sent to Let's Encrypt from many different cert-manager installations. That is also why you'll notice very few calls using the ACME client throughout these controllers.
If there is no other way to achieve this without calling out to the ACME API, then that's fine - but this needs a lot of care to ensure there's no cases where we can enter into loops. It's also worth viewing it as a last resort and an expensive call to make, so gating this kind of call behind as many local checks as possible would make sense (eg given the 'happy path' today works without making this extra call, is there an error case that we could add this call into rather than having it on the happy path).
By adding it on the happy path of all syncs of orders, we have increased the number of api calls made in this controller for everyone by 1
What this PR does / why we need it:
See #2868 for context.
Once an ACME server has validated authorizations for an order, it should set the status of that order to 'ready' (see spec)
The order can then be finalized, after which its status becomes 'valid'.
However, it is possible that a misbehaving ACME server would, at a point in time, have validated the authorizations, but not have yet set the status of the order to 'ready'.
Currently in this scenario
cert-manager
orders controller would not keep polling the ACME server for changes in order's status- instead theOrder
CR would be stuck in a pending state.Happy path:
Challenge
resources for theOrder
have also been set to 'valid'. ACME server has also set the state of the ACME order to 'ready'.Order
, observes the validChallenges
and, retrieves the status of the ACME order and updates the status of theOrder
CR accordingly (pending -> ready)Order
again (since its status was just changed), observes that the status is now 'ready' and finalizes the orderSad path:
Challenge
resources for theOrder
have also been set to 'valid'. ACME server has not (yet) set the state of the ACME order to 'ready' <- this goes against ACME spec.Order
, observes the validChallenges
and, retrieves the status of the ACME order which is still 'pending'. It tries to update the state of theOrder
CR (pending -> pending)- this update is probably not applied as there is no actual status change and no further re-syncs would be triggered - theOrder
remains pending.Which issue this PR fixes *
Fixes #2868
Special notes for your reviewer:
I fixed this by re-queing the
Order
if all theChallenge
s are valid, but the state of the ACME order is still pending, in the same way as we do to, for example, re-queue aCertificate
that needs to be renewed at a specific time.We could have also:
A. Polled the ACME server continuosly instead of re-queing- but we don't know for how long the ACME order might remain in this state, so would we allow the worker to do this potentially 'forever' or introduce a random timeout?
B. Done something with resync intervals- however this seems to be a rarely encountered edge case, so maybe we don't want to modify order contoller's resync period just for this.
I have tested this only via the integration test for orders controller that is added with this PR.
Release note:
/kind bug