Habana to `main` #1884

andrewballantyne · 2023-09-29T15:57:55Z

Closes: #1450

Description

Merging Habana Feature into main.

How Has This Been Tested?

How I am testing Habana

Ability to create accelerator profile and see it in all locations in the UI: workbench, jupyter spawner, model serving, and in both add and edit mode
Model server details shows correct accelerator name and number of accelerators
Resources and tolerations are successfully added and removed for new and edited workbenches and serving runtimes
Existing settings keeps accelerator settings for both runtimes and workbenches
Auto detect Nvidia gpu works when tolerations and resource identifier match
Setting accelerator to none removes the identifier from the resources and tolerations are removed as well
- In the case of existing settings -> tolerations are not removed because we don’t know which ones to remove
Workbench and serving runtimes will show that images/runtimes/accelerators are recommended if the identifier and annotation match
Error and warning cases are correct based on recommended annotation
Can update the count without changing the accelerator

NOTE
Migration and accelerator detection cannot be tested without a gpu cluster

Test Impact

Tests will be coming after the feature is merged

Request review criteria:

Self checklist (all need to be checked):

The developer has manually tested the changes and verified that the changes work
Commits have been squashed into descriptive, self-contained units of work (e.g. 'WIP' and 'Implements feedback' style messages have been removed)
Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
The developer has added tests or explained why testing cannot be added (unit tests & storybook for related changes)

If you have UI changes:

Included any necessary screenshots or gifs if it was a UI change.
Included tags to the UX team if it was a UI/UX change (find relevant UX in the SMEs section).

After the PR is posted & before it merges:

The developer has tested their solution on a cluster by using the image produced by the PR to main

* added cr * added accelerator profile crd * added to kustomize

add copy to clipboard icon to tooltips

fixed detected accelerator count connected accelerator detection added accelerator UI user flow hide accelerator dropdown when empty switched the format of the notebook identifier added accelerator name to serving runtime resource added serving runtimes accelerators

commit 9387956 Author: Gage Krumbach <gkrumbach@gmail.com> Date: Fri Jun 30 14:56:37 2023 -0500 added accelerator UI user flow fixed detected accelerator count connected accelerator detection added accelerator UI user flow hide accelerator dropdown when empty switched the format of the notebook identifier added accelerator name to serving runtime resource added serving runtimes accelerators

commit 26da289 Author: Gage Krumbach <gkrumbach@gmail.com> Date: Tue Aug 1 16:40:25 2023 -0500 fix error state in migration commit 391cbca Author: Gage Krumbach <gkrumbach@gmail.com> Date: Tue Aug 1 15:09:25 2023 -0500 added accelerator detection line commit 50839ac Author: Gage Krumbach <gkrumbach@gmail.com> Date: Thu Jul 27 13:52:24 2023 -0500 added gpu migration

Accelerator user flow

added gpu migration

added accelerator detection

bug fixes

fix lint errors in accelerator support

update deployed notebooks and sr on migrate fixed error logging remove container migration Added support for "keep what i have" soft migrate nvidia gpus to profiles fix handle exisiting settings refactored hooks remove useRef simplify functions small changes to hook merge hooks together update cluster role small changes bug fixes small type fix fixed type issues

Fix bug in migration for GPUS

revert add rbac accelerator role

add rbac accelerator role

making image/servingruntime naming dynamic fix count going back to 0 prevent 0 count and ux style fix removed usage of unknown when not needed remove double array usage improved backend logging fix logging undefined error make ?? consistent fixed "||" and fixed unknown / none details

Minor accelerator fixes

fix accelerator detection logic

update cluster role to allow accelerator profile creation

move from cluster role to role for accelerator create

Gkrumbach07 · 2023-09-29T17:24:01Z

@andrewballantyne added testing instructions

Gkrumbach07 · 2023-09-29T17:38:32Z

/lgtm

andrewballantyne · 2023-09-29T18:03:28Z

/approve

Going ahead with the merge. We'll have another week before code freeze / release. We can adjust small efforts along the way. QE has been verifying the feature for the past couple weeks.

openshift-ci · 2023-09-29T18:03:33Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andrewballantyne

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [andrewballantyne]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Gkrumbach07 and others added 30 commits July 3, 2023 08:15

Added accelerator CRD (#1451)

ff64ff2

* added cr * added accelerator profile crd * added to kustomize

add copy to clipboard to k8 name popover

a6c7798

Merge pull request #1561 from Gkrumbach07/copy-tooltip

4b2f50b

add copy to clipboard icon to tooltips

added gpu migration

50839ac

added accelerator detection

84c2231

added accelerator detection line

391cbca

fix error state in migration

26da289

added more resource types

ab07f22

sqush

34a2f1c

Merge branch 'accelerator-support' into accelerator-cr

7f0b159

Merge pull request #1555 from Gkrumbach07/accelerator-cr

5509fa9

Accelerator user flow

Merge pull request #1618 from Gkrumbach07/migration

1a1da24

added gpu migration

Merge pull request #1628 from Gkrumbach07/accelerator-detection

dba676e

added accelerator detection

bug fixes

e5717c3

update wording

607fe26

Merge pull request #1645 from Gkrumbach07/accelerator-support

55d3089

bug fixes

Merge branch 'main' into accelerator-support

fc89a4e

fix lint errors

c8f2767

Merge pull request #1668 from Gkrumbach07/accelerator-support

e555aaf

fix lint errors in accelerator support

Merge pull request #1677 from Gkrumbach07/accelerator-support

2e152ee

Fix bug in migration for GPUS

add rbac accelerator role

a4bb172

revert add rbac accelerator role

2a300bc

add rbac accelerator role

84ca02e

Merge pull request #1753 from Gkrumbach07/revert-commit

1afaee2

revert add rbac accelerator role

Merge pull request #1754 from Gkrumbach07/add-roles

1fe5489

add rbac accelerator role

openshift-merge-robot and others added 8 commits September 8, 2023 19:51

Merge pull request #1764 from Gkrumbach07/minor-fixes

dcd23ec

Minor accelerator fixes

fix detection logic

e529890

Merge pull request #1865 from Gkrumbach07/fix-detection

49bfc75

fix accelerator detection logic

update cluster role

548d781

Merge pull request #1877 from Gkrumbach07/update-service-role

f00119e

update cluster role to allow accelerator profile creation

move from cluster role to role

ab90061

Merge pull request #1879 from Gkrumbach07/update-service-role

1216427

move from cluster role to role for accelerator create

Merge branch 'main' into f/accelerator-support

de01574

openshift-ci bot requested review from ppadti and uidoyen September 29, 2023 15:58

andrewballantyne changed the title ~~F/accelerator support~~ Habana to main Sep 29, 2023

andrewballantyne requested a review from Gkrumbach07 September 29, 2023 16:13

andrewballantyne assigned Gkrumbach07 Sep 29, 2023

openshift-ci bot added the lgtm label Sep 29, 2023

openshift-ci bot added the approved label Sep 29, 2023

openshift-merge-robot merged commit 284aa98 into main Sep 29, 2023
6 checks passed

andrewballantyne deleted the f/accelerator-support branch November 23, 2023 20:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Habana to `main` #1884

Habana to `main` #1884

andrewballantyne commented Sep 29, 2023 •

edited by Gkrumbach07

Gkrumbach07 commented Sep 29, 2023

Gkrumbach07 commented Sep 29, 2023

andrewballantyne commented Sep 29, 2023

openshift-ci bot commented Sep 29, 2023

Habana to main #1884

Habana to main #1884

Conversation

andrewballantyne commented Sep 29, 2023 • edited by Gkrumbach07

Description

How Has This Been Tested?

Test Impact

Request review criteria:

Gkrumbach07 commented Sep 29, 2023

Gkrumbach07 commented Sep 29, 2023

andrewballantyne commented Sep 29, 2023

openshift-ci bot commented Sep 29, 2023

Habana to `main` #1884

Habana to `main` #1884

andrewballantyne commented Sep 29, 2023 •

edited by Gkrumbach07