Add autoscale check for GPU #573

maroroman · 2022-09-19T14:43:54Z

Resolves #407

Description

This PR adds logic to check the MachineAutoscaler CRs for the maximum amount of scalable GPU nodes.

How Has This Been Tested?

Testing was done on a OSD cluster with a gpu node with autoscaling enabled.

Use following dashboard image: quay.io/mroman_redhat/odh-dashboard:dev-gpu-3
The dashboard detects all MachineAutoscaler CRs and loops through their MachineSets which contain the machine.openshift.io/GPU value which should contain the true gpu amount (correct me if my assumption is wrong)

Resulting backend response was this:

The value is used by the frontend to extend the dropdown if the scalable gpus are over the amount of gpus reported by prometheus metrics.

Merge criteria:

The commits are squashed in a cohesive manner and have meaningful messages.
Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
The developer has manually tested the changes and verified that the changes work

openshift-ci · 2022-09-19T14:44:00Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

…into gpu-autoscale

maroroman · 2022-09-20T10:32:52Z

@andrewballantyne Should I add the autoscaler value to the available gpus and let the dropdown just show a higher number without a UX difference?

andrewballantyne · 2022-09-20T15:29:17Z

@maroroman see Jeff's comment here, I think the answer to your question is this sentence:

The number in the drop down should be the max available on any single node

No UX difference, just show the numbers.

frontend/src/pages/notebookController/screens/server/GPUSelectField.tsx

maroroman · 2022-10-05T17:58:47Z

FYI: Added one more force push to deal with a log and fix the role yaml.

frontend/src/pages/notebookController/screens/server/GPUSelectField.tsx

…Field.tsx Use the value from the reduce

backend/src/routes/api/gpu/gpuUtils.ts

frontend/src/pages/notebookController/screens/server/GPUSelectField.tsx

…Field.tsx

andrewballantyne · 2022-10-05T19:38:16Z

Added some data caching on the backend and removed the re-fetching toggle on the frontend.

cfchase · 2022-10-05T20:30:22Z

@maroroman we've made some progress here, but please test with a built image:

Can schedule a multi gpu notebook when a multi-gpu node is scaled down to 0 and there are no other gpu nodes but triggers a scaling event.
Can start a multi-gpu notebook when the available node only has 1 gpu available
Can start a single gpu notebook when 1 is available without scaling up

cfchase · 2022-10-06T20:49:55Z

Some comments on existing code:

There's another await in a for loop you need to Promise.all:
https://github.com/opendatahub-io/odh-dashboard/pull/573/files#diff-fe048868b84dce849bd698d4f237f445ecc70661f9ca1842d9a9429b4c2c37feR50

If that requests takes 3 seconds and there's 10 nodes, the user is going to be waiting for 30 seconds for the dialog to get a gpu value. It will grow O(n) as the cluster increases nodes. I think this is affecting all current users. Unless a request needs a previous result you shouldn't wait before executing.

I think you also have to take out the await here: https://github.com/opendatahub-io/odh-dashboard/pull/573/files#diff-fe048868b84dce849bd698d4f237f445ecc70661f9ca1842d9a9429b4c2c37feR78

This token should be found at start up, added to the fastify.kube.saToken so you don't have to fetch it on every request
https://github.com/opendatahub-io/odh-dashboard/pull/573/files#diff-fe048868b84dce849bd698d4f237f445ecc70661f9ca1842d9a9429b4c2c37feR40

cfchase

/lgtm

…into gpu-autoscale

andrewballantyne

/unhold

Forgot to unhold this after Chris approved it. Doing so now.

openshift-ci · 2022-10-19T14:55:25Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andrewballantyne, cfchase

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [andrewballantyne]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

* Add autoscale check * Clean after rebase * Add scalable gpus to dropdown count * Change code to use MachineAutoscalers * Use machineautoscalers, add logic to frontend * check available machine replicas in max scalable calculation * Add role change and fix availability check * Fix machineSet get logic * Update frontend/src/pages/notebookController/screens/server/GPUSelectField.tsx Use the value from the reduce * Update backend/src/routes/api/gpu/gpuUtils.ts * Update backend/src/routes/api/gpu/gpuUtils.ts * Update frontend/src/pages/notebookController/screens/server/GPUSelectField.tsx * add saToken to kube, optimise gpu data getting Co-authored-by: Andrew Ballantyne <8126518+andrewballantyne@users.noreply.github.com>

Add autoscale check

f95f89c

openshift-ci bot added the do-not-merge/work-in-progress This PR is in WIP state label Sep 19, 2022

maroroman self-assigned this Sep 19, 2022

openshift-merge-robot added the needs-rebase PR needs to be rebased label Sep 19, 2022

Merge branch 'main' of https://github.com/opendatahub-io/odh-dashboard …

e5d8028

…into gpu-autoscale

openshift-merge-robot removed the needs-rebase PR needs to be rebased label Sep 19, 2022

Clean after rebase

8a1735a

maroroman requested review from DaoDaoNoCode, andrewballantyne and mlassak September 20, 2022 10:31

Add scalable gpus to dropdown count

05e486c

maroroman removed request for andrewballantyne, DaoDaoNoCode and mlassak September 27, 2022 14:30

maroroman added 2 commits September 27, 2022 16:31

Change code to use MachineAutoscalers

e9596d9

Use machineautoscalers, add logic to frontend

067ce2f

maroroman force-pushed the gpu-autoscale branch from 1450cec to 067ce2f Compare September 30, 2022 13:39

maroroman marked this pull request as ready for review September 30, 2022 13:46

openshift-ci bot requested review from DaoDaoNoCode and LaVLaS September 30, 2022 13:46

maroroman requested review from andrewballantyne, cfchase and lucferbux and removed request for DaoDaoNoCode and LaVLaS September 30, 2022 13:46

andrewballantyne reviewed Sep 30, 2022

View reviewed changes

frontend/src/pages/notebookController/screens/server/GPUSelectField.tsx Outdated Show resolved Hide resolved

maroroman force-pushed the gpu-autoscale branch from 044c57f to aadae4d Compare October 4, 2022 12:13

andrewballantyne reviewed Oct 5, 2022

View reviewed changes

frontend/src/pages/notebookController/screens/server/GPUSelectField.tsx Outdated Show resolved Hide resolved

Update frontend/src/pages/notebookController/screens/server/GPUSelect…

9c75dc9

…Field.tsx Use the value from the reduce

andrewballantyne requested changes Oct 5, 2022

View reviewed changes

backend/src/routes/api/gpu/gpuUtils.ts Show resolved Hide resolved

backend/src/routes/api/gpu/gpuUtils.ts Outdated Show resolved Hide resolved

frontend/src/pages/notebookController/screens/server/GPUSelectField.tsx Outdated Show resolved Hide resolved

openshift-ci bot removed the approved label Oct 5, 2022

andrewballantyne added 3 commits October 5, 2022 15:37

Update backend/src/routes/api/gpu/gpuUtils.ts

f44bbea

Update backend/src/routes/api/gpu/gpuUtils.ts

b4e578e

Update frontend/src/pages/notebookController/screens/server/GPUSelect…

8e33a81

…Field.tsx

andrewballantyne mentioned this pull request Oct 6, 2022

[Bug]: GPUs cannot be used with Tainted Nodes #633

Closed

1 task

add saToken to kube, optimise gpu data getting

63120b7

openshift-merge-robot added the needs-rebase PR needs to be rebased label Oct 9, 2022

maroroman mentioned this pull request Oct 10, 2022

Add gpu call optimisations #644

Closed

3 tasks

cfchase approved these changes Oct 12, 2022

View reviewed changes

openshift-ci bot assigned cfchase Oct 12, 2022

openshift-ci bot added the lgtm label Oct 12, 2022

Merge branch 'main' of https://github.com/opendatahub-io/odh-dashboard …

582adb8

…into gpu-autoscale

openshift-ci bot removed the lgtm label Oct 12, 2022

openshift-merge-robot removed the needs-rebase PR needs to be rebased label Oct 12, 2022

andrewballantyne approved these changes Oct 19, 2022

View reviewed changes

openshift-ci bot added lgtm and removed do-not-merge/hold This PR is hold for some reason labels Oct 19, 2022

openshift-ci bot added the approved label Oct 19, 2022

openshift-merge-robot merged commit 2b3cdfd into opendatahub-io:main Oct 19, 2022

andrewballantyne mentioned this pull request Oct 21, 2022

Use the kube config to get the current logged in token #686

Merged

maroroman mentioned this pull request Oct 24, 2022

Fix gpu fileread #691

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add autoscale check for GPU #573

Add autoscale check for GPU #573

maroroman commented Sep 19, 2022 •

edited

openshift-ci bot commented Sep 19, 2022

maroroman commented Sep 20, 2022

andrewballantyne commented Sep 20, 2022

maroroman commented Oct 5, 2022

andrewballantyne commented Oct 5, 2022

cfchase commented Oct 5, 2022

cfchase commented Oct 6, 2022

cfchase left a comment

andrewballantyne left a comment

openshift-ci bot commented Oct 19, 2022

Add autoscale check for GPU #573

Add autoscale check for GPU #573

Conversation

maroroman commented Sep 19, 2022 • edited

Description

How Has This Been Tested?

Merge criteria:

openshift-ci bot commented Sep 19, 2022

maroroman commented Sep 20, 2022

andrewballantyne commented Sep 20, 2022

maroroman commented Oct 5, 2022

andrewballantyne commented Oct 5, 2022

cfchase commented Oct 5, 2022

cfchase commented Oct 6, 2022

cfchase left a comment

Choose a reason for hiding this comment

andrewballantyne left a comment

Choose a reason for hiding this comment

openshift-ci bot commented Oct 19, 2022

maroroman commented Sep 19, 2022 •

edited