[Feature Request]: Move tolerationSettings from notebooks generally to data science projects #1306

shalberd · 2023-05-29T19:26:46Z

Feature description

Currently, the notebook toleration settings from odh dashboard config apply to all notebooks in all namespaces.

Assume we have a cluster with different dedicated nodes per customer:

nodes A-B (worker nodes) tainted NoExecute, Equal, key: customer, value: customer1
nodes C-D (worker nodes) tainted NoExecute, Equal, key: customer, value: customer2

The idea is having namespaces per customer, it can be one namespace per user, I have grown used to that concept, but there needs to be a way to ensure that users / workbench namespaces can belong to different customers and have different scheduling placements for pods in terms of on which node they land.

So, my suggestion would be to

move notebookTolerationSettings in ODH Dashboard Config being a global setting for all notebooks in all namespaces to tolerationSettings on specific Data Science Projects, that is, namespace / project-specific
change effect from NoSchedule to NoExecute to ensure that existing pods on the node are evicted and moved to a non-taint node
change operator from Exists to Equal. Exists is ok for evaluating node taint keys like nvidia.com/gpu, where the value does not matter, I presume. But it is not ok for tolerations where key AND value must match, e.g. my described scenario above. Just matching key: customer would not be enough.

Describe alternatives you've considered

For now, we do not have multiple customers, with data science projects namespaces grouped per customer, so we schedule all notebooks on nodes with a given node taint key, e.g. key: opendatahub, using the existing mechanism in OdhDashboardConfig.

But going forward, the issue of moving to namespace-specific instead of for-all configs will become important. Be it for tolerations or for things like linking all service accounts to an image pull secret, also those dynamic ones for notebooks in data science projects.

Anything else?

No response

Gkrumbach07 · 2023-05-31T14:17:11Z

cc @andrewballantyne

bdattoma · 2023-11-23T13:48:37Z

could this be applied to models as well? Maybe we could have a set of tolerations to allow models to be served on GPU nodes which are dedicated to serving by mean of taints

andrewballantyne · 2023-11-23T17:37:54Z

could this be applied to models as well? Maybe we could have a set of tolerations to allow models to be served on GPU nodes which are dedicated to serving by mean of taints

This is no longer the case when we talk in AcceleratorProfiles. I think 1.33 or 2.4 of RHOAI has Accelerator Profiles. Tolerations behind GPU usage so you can effectively use taints is already covered @bdattoma

This request is for allowing more flexibility in general tolerations for Notebooks (and in general I imagine all of a set of DS Project resources -- unrelated to GPUs or Accelerators)

andrewballantyne · 2023-11-23T17:39:41Z

I think this predates the UX flow. Moving to UX.

UX Context

I think we need to design a way to bring the NotebookTolerations cluster settings to the project so the user can manage their resources against tolerations. This may be more possible with the added state in the admin view of Habana part 2 & the toleration modal. #1255

bdattoma · 2023-11-27T10:09:07Z

This is no longer the case when we talk in AcceleratorProfiles. I think 1.33 or 2.4 of RHOAI has Accelerator Profiles. Tolerations behind GPU usage so you can effectively use taints is already covered @bdattoma

Is it possible to set a custom toleration for the accelerator? If I don't want to use the default nvidia.com/gpu which I think is automatically added when attaching the GPU profile.

andrewballantyne · 2023-11-27T16:47:50Z

This is no longer the case when we talk in AcceleratorProfiles. I think 1.33 or 2.4 of RHOAI has Accelerator Profiles. Tolerations behind GPU usage so you can effectively use taints is already covered @bdattoma

Is it possible to set a custom toleration for the accelerator? If I don't want to use the default nvidia.com/gpu which I think is automatically added when attaching the GPU profile.

@bdattoma Yes it is -- when you create the AcceleratorProfile (or modify the one we create on migration) you can pick whatever tolerations you want and as many as you want. Our old world was a single static toleration, so we migrate with that -- but it is modifiable.

The Admin UI is coming in 2.6 I believe, and is currently in incubation if you want to check it out. The tracker: #1255

shalberd added kind/enhancement New functionality request (existing augments or new additions) untriaged Indicates the newly create issue has not been triaged yet labels May 29, 2023

Gkrumbach07 added feature/ds-projects Data Science Projects feature (formerly Data Science Groupings - DSG) priority/normal An issue with the product; fix when possible and removed untriaged Indicates the newly create issue has not been triaged yet labels May 31, 2023

Gkrumbach07 added the needs-info Further information is requested from the reporter or from another source label May 31, 2023

andrewballantyne removed the needs-info Further information is requested from the reporter or from another source label Nov 23, 2023

dgutride added the community label Jan 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request]: Move tolerationSettings from notebooks generally to data science projects #1306

[Feature Request]: Move tolerationSettings from notebooks generally to data science projects #1306

shalberd commented May 29, 2023 •

edited

Gkrumbach07 commented May 31, 2023

bdattoma commented Nov 23, 2023

andrewballantyne commented Nov 23, 2023

andrewballantyne commented Nov 23, 2023

bdattoma commented Nov 27, 2023

andrewballantyne commented Nov 27, 2023

[Feature Request]: Move tolerationSettings from notebooks generally to data science projects #1306

[Feature Request]: Move tolerationSettings from notebooks generally to data science projects #1306

Comments

shalberd commented May 29, 2023 • edited

Feature description

Describe alternatives you've considered

Anything else?

Gkrumbach07 commented May 31, 2023

bdattoma commented Nov 23, 2023

andrewballantyne commented Nov 23, 2023

andrewballantyne commented Nov 23, 2023

UX Context

bdattoma commented Nov 27, 2023

andrewballantyne commented Nov 27, 2023

shalberd commented May 29, 2023 •

edited