Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quota check preceding resource check #614

Merged
merged 28 commits into from
Aug 30, 2023

Conversation

metalcycling
Copy link
Collaborator

@metalcycling metalcycling commented Aug 28, 2023

Issue link

#467

What changes have been made

MCAD with quota management allows AppWrappers to borrow quota from other nodes to maximize cluster utilization. In such cases, the borrowing AppWrappers can be preempted by new AppWrappers that request quota within their allocation (because the borrowing AppWrappers should not have full control of quota that is not theirs). This means that when counting the resources available in the cluster to consider if an AppWrapper can be dispatched, the utilized resources of AppWrappers that could be preempted due to borrowing should be considered available for the dispatch check. This PR swaps the resource check and the quota check so that quota is checked first and it updates the resources available in case some AppWrappers can be preempted. If the quota passes but the resources are not enough (even after preemption), the state of the quota tree is recovered and the AppWrapper is put back in the queue for later retry.

Verification steps

Manual testing was performed with 2 AppWrappers: The first one fills up the cluster even though its quota is just a portion of the cluster. The second one requests quota withing its allocation, and despite the fact that the cluster is full, it is dispatched and the first AppWrapper is preempted.

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • Testing is not required for this change

…dispatch AppWrappers that would have enough quota if borrowing AppWrappers are discounted from the resources (due to those being preempted).
@metalcycling
Copy link
Collaborator Author

I recommend looking at the changes in queuejob_controller_ex.go as side by side files. The changes are too different which causes diff to look weird.

Copy link
Member

@asm582 asm582 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the PR, please address the review

@metalcycling metalcycling changed the title Quota management Quota check preceding resource check Aug 28, 2023
Copy link
Member

@asm582 asm582 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We reviewed offline. the integration looks good but has a couple of TODOs that should be revisited.

@openshift-ci openshift-ci bot removed the lgtm label Aug 29, 2023
@asm582
Copy link
Member

asm582 commented Aug 30, 2023

/approve

@openshift-ci
Copy link

openshift-ci bot commented Aug 30, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: asm582

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@asm582
Copy link
Member

asm582 commented Aug 30, 2023

/lgtm

@openshift-ci openshift-ci bot added the lgtm label Aug 30, 2023
@metalcycling metalcycling merged commit 9df2f58 into project-codeflare:main Aug 30, 2023
1 of 2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants