Skip to content

MIR-970: Unblock addon deprovisioning when provisioning saga is in-flight#758

Merged
evanphx merged 1 commit into
mainfrom
mir-970-addon-deprovisioning-blocked-when-provisioning-sag
Apr 13, 2026
Merged

MIR-970: Unblock addon deprovisioning when provisioning saga is in-flight#758
evanphx merged 1 commit into
mainfrom
mir-970-addon-deprovisioning-blocked-when-provisioning-sag

Conversation

@evanphx
Copy link
Copy Markdown
Contributor

@evanphx evanphx commented Apr 13, 2026

Summary

  • Bump addon reconciler worker count from 1 → 4 so a long-running provisioning saga (e.g. valkey waiting on sandbox pool readiness) no longer blocks every other addon event. Same-entity serialization is still guaranteed by ReconcileController.inFlight.
  • Add a pre-flight re-read at the top of provision(): if the association is no longer pending (e.g. user ran addon destroy during the crash/resync window) skip instead of re-starting a saga that would conflict with leftover resources.

Why

Observed sequence:

  1. Reconciler starts provisioning rabbitmq and valkey.
  2. Valkey saga blocks waiting on sandbox pool readiness — occupies the sole worker.
  3. User runs addon destroy; status flips to deprovisioning.
  4. Deprovisioning event sits in the work queue forever.

On restart, the resync re-enqueues stale pending events and re-triggers provisioning on associations the user had already destroyed, creating conflicting resources.

Test plan

  • go test ./controllers/addon/ — new TestProvisionSkipsWhenAssociationNoLongerPending covers the stale-event path.
  • go vet ./controllers/addon/... ./components/coordinate/...
  • Smoke in make dev: install valkey + rabbitmq, confirm rabbitmq finishes while valkey is still waiting on its sandbox pool.

Out of scope

Coordinating saga Recover() with a concurrent reconciler-driven re-provision is tracked as a follow-up — the fixes here narrow the window significantly but do not fully close it.

The addon reconciler ran with a single worker, so a long-running
provisioning saga (e.g. valkey waiting on sandbox pool readiness)
blocked every other addon event — including user-triggered
deprovisioning. Bumps the worker count to 4 so independent
associations reconcile in parallel (same-entity serialization is
already guaranteed by the framework's in-flight map).

Also adds a pre-flight re-read in provision() so a stale "pending"
event (e.g. from startup resync) does not re-trigger provisioning
on an association the user has already destroyed.

Fixes MIR-970.
@evanphx evanphx requested a review from a team as a code owner April 13, 2026 21:46
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 13, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 4eaa390b-a3c4-48f2-a90d-22afca0c7dd7

📥 Commits

Reviewing files that changed from the base of the PR and between d00e1a2 and 86c960a.

📒 Files selected for processing (3)
  • components/coordinate/coordinate.go
  • controllers/addon/controller.go
  • controllers/addon/controller_test.go

📝 Walkthrough

Walkthrough

This pull request increases the concurrency of the addon association reconciliation controller from one to four workers. Additionally, the provisioning logic now includes a safeguard that re-fetches the current AddonAssociation by ID before transitioning its state. If the re-fetched association's status is no longer "pending", provisioning is skipped. If the re-fetch fails, an error is returned. A new test case validates that provisioning correctly skips when an association's status changes between initial retrieval and the provisioning step.

Warning

Review ran into problems

🔥 Problems

Git: Failed to clone repository. Please run the @coderabbitai full review command to re-trigger a full review. If the issue persists, set path_filters to include or exclude specific files.


Comment @coderabbitai help to get the list of available commands and usage tips.

// Multiple workers so a long-running provisioning saga for one
// association does not block reconciliation of others. Same-entity
// concurrency is already prevented by ReconcileController.inFlight.
4,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason to go to 4 rather than 2 here? Two workers already solve the stated problem (one slow saga can't starve the queue), and each provisioning saga does a fair bit of entity store work (config versions, app versions, active version patches). Starting conservative and bumping later if someone actually hits the limit seems like the lower-risk move. Don't feel strongly tho, totally shippable as is!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already done 3 elsewhere and if these are slower, due to startup, felt like it was a good fit to reach for 4 for now.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM!

@evanphx evanphx merged commit 8154323 into main Apr 13, 2026
22 of 23 checks passed
@evanphx evanphx deleted the mir-970-addon-deprovisioning-blocked-when-provisioning-sag branch April 13, 2026 23:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants