2686 fixes stuck Orders issue #3805

irbekrm · 2021-03-25T10:24:14Z

What this PR does / why we need it:

See #2868 for context.

Once an ACME server has validated authorizations for an order, it should set the status of that order to 'ready' (see spec)
The order can then be finalized, after which its status becomes 'valid'.
However, it is possible that a misbehaving ACME server would, at a point in time, have validated the authorizations, but not have yet set the status of the order to 'ready'.
Currently in this scenario cert-manager orders controller would not keep polling the ACME server for changes in order's status- instead the Order CR would be stuck in a pending state.

Happy path:

All authorizations are valid, so the Challenge resources for the Order have also been set to 'valid'. ACME server has also set the state of the ACME order to 'ready'.
Orders controller re-syncs the Order, observes the valid Challenges and, retrieves the status of the ACME order and updates the status of the Order CR accordingly (pending -> ready)
Order controller re-syncs the Order again (since its status was just changed), observes that the status is now 'ready' and finalizes the order

Sad path:

All authorizations are valid, so the Challenge resources for the Order have also been set to 'valid'. ACME server has not (yet) set the state of the ACME order to 'ready' <- this goes against ACME spec.
Orders controller re-syncs the Order, observes the valid Challenges and, retrieves the status of the ACME order which is still 'pending'. It tries to update the state of the Order CR (pending -> pending)- this update is probably not applied as there is no actual status change and no further re-syncs would be triggered - the Order remains pending.

Which issue this PR fixes *

Fixes #2868

Special notes for your reviewer:

I fixed this by re-queing the Order if all the Challenges are valid, but the state of the ACME order is still pending, in the same way as we do to, for example, re-queue a Certificate that needs to be renewed at a specific time.

We could have also:

A. Polled the ACME server continuosly instead of re-queing- but we don't know for how long the ACME order might remain in this state, so would we allow the worker to do this potentially 'forever' or introduce a random timeout?

B. Done something with resync intervals- however this seems to be a rarely encountered edge case, so maybe we don't want to modify order contoller's resync period just for this.

I have tested this only via the integration test for orders controller that is added with this PR.

Release note:

Fixes stuck Orders in case of a misbehaving ACME server

/kind bug

irbekrm · 2021-03-25T10:50:43Z

/retest

irbekrm · 2021-03-25T10:56:08Z

A lot of setup failures lately, it seems.

maelvls · 2021-04-07T15:13:19Z

/milestone v1.4

wallrj

Thanks @irbekrm

Really nice refactoring to slot in the scheduled work queue.
And thanks for adding that integration test.

I left one or two comments and suggestions below which you can answer or address as you like.

pkg/controller/acmeorders/checks.go

pkg/controller/acmeorders/sync.go

pkg/controller/acmeorders/controller.go

pkg/controller/acmeorders/sync.go

pkg/controller/acmeorders/sync_test.go

test/integration/acme/orders_controller_test.go

pkg/controller/acmeorders/controller.go

irbekrm · 2021-05-17T15:41:44Z

Thank you for the code review and good suggestions @wallrj !

I think I have addressed all and created new issues for some, please take a look.

pkg/controller/acmeorders/sync.go

test/integration/acme/orders_controller_test.go

wallrj

Looks good to me @irbekrm

Just a typo and a suggestion which you can ignore and /unhold if you prefer.

/lgtm
/hold

irbekrm · 2021-05-19T10:36:18Z

Thanks for the review @wallrj !
I've tried to address the comments, please take a look.

wallrj

👍

/lgtm

irbekrm · 2021-05-19T12:01:56Z

/hold cancel

To avoid stuck Orders in case of a misbehaving ACME server Signed-off-by: irbekrm <irbekrm@gmail.com>

To allow for testing whether an item gets re-queued in unit tests Signed-off-by: irbekrm <irbekrm@gmail.com>

Signed-off-by: irbekrm <irbekrm@gmail.com>

So that it easier used with the existing test framework and also is more similar to how most other controllers are created Signed-off-by: irbekrm <irbekrm@gmail.com>

Signed-off-by: irbekrm <irbekrm@gmail.com>

wallrj

/lgtm

jetstack-bot · 2021-05-19T13:09:44Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: irbekrm, wallrj

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [irbekrm,wallrj]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

munnerz · 2021-06-04T09:40:47Z

pkg/controller/acmeorders/sync.go

@@ -144,10 +152,25 @@ func (c *controller) Sync(ctx context.Context, o *cmacme.Order) (err error) {
 		return err
 	}

+	acmeOrder, err := getACMEOrder(ctx, cl, o)


Just stumbled across this - so one thing to note is that this controller (and all other acme controllers) have always made a point to avoid calling the ACME API unless it is strictly necessary. This is because, in instances where unexpected errors can crop up (or a bug in our code), it can lead to large amounts of requests being sent to Let's Encrypt from many different cert-manager installations. That is also why you'll notice very few calls using the ACME client throughout these controllers.

If there is no other way to achieve this without calling out to the ACME API, then that's fine - but this needs a lot of care to ensure there's no cases where we can enter into loops. It's also worth viewing it as a last resort and an expensive call to make, so gating this kind of call behind as many local checks as possible would make sense (eg given the 'happy path' today works without making this extra call, is there an error case that we could add this call into rather than having it on the happy path).

By adding it on the happy path of all syncs of orders, we have increased the number of api calls made in this controller for everyone by 1

jetstack-bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. dco-signoff: yes Indicates that all commits in the pull request have the valid DCO sign-off message. labels Mar 25, 2021

jetstack-bot requested a review from meyskens March 25, 2021 10:24

irbekrm changed the title ~~2686 stuck orders~~ 2686 fixes stuck Orders issue Mar 25, 2021

irbekrm force-pushed the 2686_stuck_orders branch from fda3f6e to 6769ca6 Compare March 25, 2021 10:34

irbekrm added the kind/bug Categorizes issue or PR as related to a bug. label Mar 25, 2021

jetstack-bot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. and removed needs-kind Indicates a PR lacks a `kind/foo` label and requires one. labels Mar 25, 2021

irbekrm force-pushed the 2686_stuck_orders branch from 6769ca6 to bd31d3f Compare April 2, 2021 10:13

jetstack-bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 2, 2021

jetstack-bot added this to the v1.4 milestone Apr 7, 2021

jetstack-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 14, 2021

irbekrm force-pushed the 2686_stuck_orders branch from bd31d3f to e79d6c1 Compare April 29, 2021 04:18

jetstack-bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 29, 2021

irbekrm mentioned this pull request Apr 29, 2021

acme: checking Order status periodically when Challenges are 'valid' to avoid stuck Orders #2868

Closed

jetstack-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 7, 2021

irbekrm force-pushed the 2686_stuck_orders branch from e79d6c1 to 3812c61 Compare May 10, 2021 09:10

jetstack-bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 10, 2021

wallrj self-assigned this May 12, 2021

wallrj requested changes May 12, 2021

View reviewed changes

wallrj removed their assignment May 12, 2021

wallrj reviewed May 18, 2021

View reviewed changes

pkg/controller/acmeorders/sync.go Outdated Show resolved Hide resolved

pkg/controller/acmeorders/sync.go Outdated Show resolved Hide resolved

test/integration/acme/orders_controller_test.go Outdated Show resolved Hide resolved

wallrj approved these changes May 18, 2021

View reviewed changes

jetstack-bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 18, 2021

jetstack-bot assigned wallrj May 18, 2021

jetstack-bot added the lgtm Indicates that a PR is ready to be merged. label May 18, 2021

wallrj mentioned this pull request May 18, 2021

Fix static analysis errors #3965

Merged

irbekrm force-pushed the 2686_stuck_orders branch from ca1c348 to bff1aab Compare May 19, 2021 09:05

jetstack-bot removed the lgtm Indicates that a PR is ready to be merged. label May 19, 2021

wallrj approved these changes May 19, 2021

View reviewed changes

jetstack-bot added the lgtm Indicates that a PR is ready to be merged. label May 19, 2021

jetstack-bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 19, 2021

irbekrm added 6 commits May 19, 2021 13:05

Re-queue Order with finalized Challenges, but pending state

1e235c7

To avoid stuck Orders in case of a misbehaving ACME server Signed-off-by: irbekrm <irbekrm@gmail.com>

Add a fake scheduler

8d5023a

To allow for testing whether an item gets re-queued in unit tests Signed-off-by: irbekrm <irbekrm@gmail.com>

Unit test pending ACME order with valid challenges

8d55b69

Signed-off-by: irbekrm <irbekrm@gmail.com>

Refactors creation of ACME Orders controller

d8c941d

So that it easier used with the existing test framework and also is more similar to how most other controllers are created Signed-off-by: irbekrm <irbekrm@gmail.com>

Integration test for ACME Orders controller

bbfd229

Signed-off-by: irbekrm <irbekrm@gmail.com>

Implements feedback from code review

06f6b46

Signed-off-by: irbekrm <irbekrm@gmail.com>

irbekrm force-pushed the 2686_stuck_orders branch from bff1aab to 06f6b46 Compare May 19, 2021 12:22

jetstack-bot removed the lgtm Indicates that a PR is ready to be merged. label May 19, 2021

wallrj approved these changes May 19, 2021

View reviewed changes

jetstack-bot added the lgtm Indicates that a PR is ready to be merged. label May 19, 2021

jetstack-bot merged commit 9f5daa7 into cert-manager:master May 19, 2021

irbekrm deleted the 2686_stuck_orders branch May 19, 2021 14:32

munnerz reviewed Jun 4, 2021

View reviewed changes

This was referenced Mar 6, 2022

Retry on conflict for the end-to-end test "added an additional dnsName" #4924

Merged

Retry on conflict for the end-to-end test "CA Injector for api services should update data when the certificate changes" #4925

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2686 fixes stuck Orders issue #3805

2686 fixes stuck Orders issue #3805

irbekrm commented Mar 25, 2021 •

edited

Loading

irbekrm commented Mar 25, 2021

irbekrm commented Mar 25, 2021

maelvls commented Apr 7, 2021

wallrj left a comment

irbekrm commented May 17, 2021

wallrj left a comment

irbekrm commented May 19, 2021

wallrj left a comment

irbekrm commented May 19, 2021

wallrj left a comment

jetstack-bot commented May 19, 2021

munnerz Jun 4, 2021

2686 fixes stuck Orders issue #3805

2686 fixes stuck Orders issue #3805

Conversation

irbekrm commented Mar 25, 2021 • edited Loading

irbekrm commented Mar 25, 2021

irbekrm commented Mar 25, 2021

maelvls commented Apr 7, 2021

wallrj left a comment

Choose a reason for hiding this comment

irbekrm commented May 17, 2021

wallrj left a comment

Choose a reason for hiding this comment

irbekrm commented May 19, 2021

wallrj left a comment

Choose a reason for hiding this comment

irbekrm commented May 19, 2021

wallrj left a comment

Choose a reason for hiding this comment

jetstack-bot commented May 19, 2021

munnerz Jun 4, 2021

Choose a reason for hiding this comment

irbekrm commented Mar 25, 2021 •

edited

Loading