MIR-970: Unblock addon deprovisioning when provisioning saga is in-flight#758
Conversation
The addon reconciler ran with a single worker, so a long-running provisioning saga (e.g. valkey waiting on sandbox pool readiness) blocked every other addon event — including user-triggered deprovisioning. Bumps the worker count to 4 so independent associations reconcile in parallel (same-entity serialization is already guaranteed by the framework's in-flight map). Also adds a pre-flight re-read in provision() so a stale "pending" event (e.g. from startup resync) does not re-trigger provisioning on an association the user has already destroyed. Fixes MIR-970.
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (3)
📝 WalkthroughWalkthroughThis pull request increases the concurrency of the addon association reconciliation controller from one to four workers. Additionally, the provisioning logic now includes a safeguard that re-fetches the current Warning Review ran into problems🔥 ProblemsGit: Failed to clone repository. Please run the Comment |
| // Multiple workers so a long-running provisioning saga for one | ||
| // association does not block reconciliation of others. Same-entity | ||
| // concurrency is already prevented by ReconcileController.inFlight. | ||
| 4, |
There was a problem hiding this comment.
Any reason to go to 4 rather than 2 here? Two workers already solve the stated problem (one slow saga can't starve the queue), and each provisioning saga does a fair bit of entity store work (config versions, app versions, active version patches). Starting conservative and bumping later if someone actually hits the limit seems like the lower-risk move. Don't feel strongly tho, totally shippable as is!
There was a problem hiding this comment.
We already done 3 elsewhere and if these are slower, due to startup, felt like it was a good fit to reach for 4 for now.
Summary
ReconcileController.inFlight.provision(): if the association is no longerpending(e.g. user ranaddon destroyduring the crash/resync window) skip instead of re-starting a saga that would conflict with leftover resources.Why
Observed sequence:
addon destroy; status flips todeprovisioning.On restart, the resync re-enqueues stale
pendingevents and re-triggers provisioning on associations the user had already destroyed, creating conflicting resources.Test plan
go test ./controllers/addon/— newTestProvisionSkipsWhenAssociationNoLongerPendingcovers the stale-event path.go vet ./controllers/addon/... ./components/coordinate/...make dev: install valkey + rabbitmq, confirm rabbitmq finishes while valkey is still waiting on its sandbox pool.Out of scope
Coordinating saga
Recover()with a concurrent reconciler-driven re-provision is tracked as a follow-up — the fixes here narrow the window significantly but do not fully close it.