Skip to content

Conversation

@joelanford
Copy link
Member

Description of the change:

Move the listInstallPlans() call to after the muInstallPlan lock is acquired in ensureInstallPlan(). Previously, the check for existing install plans happened before acquiring the lock, which meant that worker 2 could check for install plans before worker 1 had a chance to create one, even with the lock in place.

This fixes the race condition by ensuring that the sequence is:

  1. Worker 1 acquires lock
  2. Worker 1 checks for existing install plans
  3. Worker 1 creates new install plan (if needed)
  4. Worker 1 releases lock
  5. Worker 2 acquires lock
  6. Worker 2 checks for existing install plans (now sees worker 1's)
  7. Worker 2 reuses existing install plan
  8. Worker 2 releases lock

Motivation for the change:

Without this fix, both workers could see no existing install plans before either acquired the lock, leading to duplicate install plans being created for the same subscription.

Architectural changes:

None

Testing remarks:

Reviewer Checklist

  • Implementation matches the proposed design, or proposal is updated to match implementation
  • Sufficient unit test coverage
  • Sufficient end-to-end test coverage
  • Bug fixes are accompanied by regression test(s)
  • e2e tests and flake fixes are accompanied evidence of flake testing, e.g. executing the test 100(0) times
  • tech debt/todo is accompanied by issue link(s) in comments in the surrounding code
  • Tests are comprehensible, e.g. Ginkgo DSL is being used appropriately
  • Docs updated or added to /doc
  • Commit messages sensible and descriptive
  • Tests marked as [FLAKE] are truly flaky and have an issue
  • Code is properly formatted

Move the listInstallPlans() call to after the muInstallPlan lock is
acquired in ensureInstallPlan(). Previously, the check for existing
install plans happened before acquiring the lock, which meant that
worker 2 could check for install plans before worker 1 had a chance
to create one, even with the lock in place.

This fixes the race condition by ensuring that the sequence is:
1. Worker 1 acquires lock
2. Worker 1 checks for existing install plans
3. Worker 1 creates new install plan (if needed)
4. Worker 1 releases lock
5. Worker 2 acquires lock
6. Worker 2 checks for existing install plans (now sees worker 1's)
7. Worker 2 reuses existing install plan
8. Worker 2 releases lock

Without this fix, both workers could see no existing install plans
before either acquired the lock, leading to duplicate install plans
being created for the same subscription.
@perdasilva
Copy link
Collaborator

/approve

@openshift-ci
Copy link

openshift-ci bot commented Oct 23, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: perdasilva

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 23, 2025
@perdasilva
Copy link
Collaborator

Just saw #3682 solves this in the same way (and adds a unit test). Let's get that one down instead?

@perdasilva perdasilva removed the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 23, 2025
@joelanford
Copy link
Member Author

Duplicate of #3682 , but that one has a really nice unit test! Closing this one.

@joelanford joelanford closed this Oct 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants