Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add autoscale check for GPU #573

Merged
merged 15 commits into from Oct 19, 2022

Conversation

maroroman
Copy link
Contributor

@maroroman maroroman commented Sep 19, 2022

Resolves #407

Description

This PR adds logic to check the MachineAutoscaler CRs for the maximum amount of scalable GPU nodes.

How Has This Been Tested?

Testing was done on a OSD cluster with a gpu node with autoscaling enabled.

  1. Use following dashboard image: quay.io/mroman_redhat/odh-dashboard:dev-gpu-3
  2. The dashboard detects all MachineAutoscaler CRs and loops through their MachineSets which contain the machine.openshift.io/GPU value which should contain the true gpu amount (correct me if my assumption is wrong)

Resulting backend response was this:
image

The value is used by the frontend to extend the dropdown if the scalable gpus are over the amount of gpus reported by prometheus metrics.

Merge criteria:

  • The commits are squashed in a cohesive manner and have meaningful messages.
  • Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
  • The developer has manually tested the changes and verified that the changes work

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress This PR is in WIP state label Sep 19, 2022
@maroroman maroroman self-assigned this Sep 19, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Sep 19, 2022

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-merge-robot openshift-merge-robot added the needs-rebase PR needs to be rebased label Sep 19, 2022
@openshift-merge-robot openshift-merge-robot removed the needs-rebase PR needs to be rebased label Sep 19, 2022
@maroroman
Copy link
Contributor Author

@andrewballantyne Should I add the autoscaler value to the available gpus and let the dropdown just show a higher number without a UX difference?

@andrewballantyne
Copy link
Member

@maroroman see Jeff's comment here, I think the answer to your question is this sentence:

The number in the drop down should be the max available on any single node

No UX difference, just show the numbers.

@maroroman maroroman marked this pull request as ready for review September 30, 2022 13:46
@maroroman maroroman requested review from andrewballantyne, cfchase and lucferbux and removed request for DaoDaoNoCode and LaVLaS September 30, 2022 13:46
@maroroman
Copy link
Contributor Author

FYI: Added one more force push to deal with a log and fix the role yaml.

@openshift-ci openshift-ci bot removed the approved label Oct 5, 2022
@andrewballantyne
Copy link
Member

Added some data caching on the backend and removed the re-fetching toggle on the frontend.

@cfchase
Copy link
Member

cfchase commented Oct 5, 2022

@maroroman we've made some progress here, but please test with a built image:

  1. Can schedule a multi gpu notebook when a multi-gpu node is scaled down to 0 and there are no other gpu nodes but triggers a scaling event.
  2. Can start a multi-gpu notebook when the available node only has 1 gpu available
  3. Can start a single gpu notebook when 1 is available without scaling up

@cfchase
Copy link
Member

cfchase commented Oct 6, 2022

Some comments on existing code:

There's another await in a for loop you need to Promise.all:
https://github.com/opendatahub-io/odh-dashboard/pull/573/files#diff-fe048868b84dce849bd698d4f237f445ecc70661f9ca1842d9a9429b4c2c37feR50

If that requests takes 3 seconds and there's 10 nodes, the user is going to be waiting for 30 seconds for the dialog to get a gpu value. It will grow O(n) as the cluster increases nodes. I think this is affecting all current users. Unless a request needs a previous result you shouldn't wait before executing.

I think you also have to take out the await here: https://github.com/opendatahub-io/odh-dashboard/pull/573/files#diff-fe048868b84dce849bd698d4f237f445ecc70661f9ca1842d9a9429b4c2c37feR78

This token should be found at start up, added to the fastify.kube.saToken so you don't have to fetch it on every request
https://github.com/opendatahub-io/odh-dashboard/pull/573/files#diff-fe048868b84dce849bd698d4f237f445ecc70661f9ca1842d9a9429b4c2c37feR40

@openshift-merge-robot openshift-merge-robot added the needs-rebase PR needs to be rebased label Oct 9, 2022
@maroroman maroroman mentioned this pull request Oct 10, 2022
3 tasks
Copy link
Member

@cfchase cfchase left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci openshift-ci bot removed the lgtm label Oct 12, 2022
@openshift-merge-robot openshift-merge-robot removed the needs-rebase PR needs to be rebased label Oct 12, 2022
Copy link
Member

@andrewballantyne andrewballantyne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/unhold

Forgot to unhold this after Chris approved it. Doing so now.

@openshift-ci openshift-ci bot added lgtm and removed do-not-merge/hold This PR is hold for some reason labels Oct 19, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 19, 2022

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andrewballantyne, cfchase

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-robot openshift-merge-robot merged commit 2b3cdfd into opendatahub-io:main Oct 19, 2022
@maroroman maroroman mentioned this pull request Oct 24, 2022
3 tasks
strangiato pushed a commit to strangiato/odh-dashboard that referenced this pull request Oct 18, 2023
* Add autoscale check

* Clean after rebase

* Add scalable gpus to dropdown count

* Change code to use MachineAutoscalers

* Use machineautoscalers, add logic to frontend

* check available machine replicas in max scalable calculation

* Add role change and fix availability check

* Fix machineSet get logic

* Update frontend/src/pages/notebookController/screens/server/GPUSelectField.tsx

Use the value from the reduce

* Update backend/src/routes/api/gpu/gpuUtils.ts

* Update backend/src/routes/api/gpu/gpuUtils.ts

* Update frontend/src/pages/notebookController/screens/server/GPUSelectField.tsx

* add saToken to kube, optimise gpu data getting

Co-authored-by: Andrew Ballantyne <8126518+andrewballantyne@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature Request]: GPUs for Auto Scaling
4 participants