Enhance Nvidia Device plugin with more health checking features #12

jiayingz · 2017-12-08T00:27:09Z

Quoting what @RenaudWasTaken mentioned in another thread:
"The Nvidia Device plugin has a lot of such features coming up a few of these are:

memory scrubbing
healthCheck and reset in case of bad state
GPU Allocated memory checks
"Zombie processes" checks
...
"

Creating this issue to track the progress on these improvements.

@RenaudWasTaken could you also provide more details on some of these features, like what GPU Allocated memory checks and "Zombie processes" checks do?

jiayingz · 2017-12-08T00:28:04Z

cc @jiayingz @vishh @mindprince

flx42 · 2017-12-08T00:41:17Z

what GPU Allocated memory checks and "Zombie processes" checks do?

Those are related, and it's related to reset too. If it's not possible to reset your card, you at least need to detect when things are broken.

When things go awfully wrong, you can have the following:

There is no process using the GPU, but nvidia-smi shows a non-trivial amount of memory being used. e.g. something like 352MiB / 12181MiB
There is a process already using the GPU you are supposed to give to a new container (excluding voluntary sharing). This can happen when there is a GPU fault and the process that had an open CUDA context can't teardown properly.

These kinds of checks are useful safety checks in addition to event-based healthchecks like XIDs and ECCs. Some of these errors could go unnoticed otherwise.

ScorpioCPH · 2017-12-13T02:48:04Z

@flx42 Thanks for your detailed explanation, it's very useful for us!

wsxiaozhang · 2019-06-19T11:23:52Z

@flx42 @jiayingz any plan to enhance current device health check?

chore(*): helm chart

github-actions · 2024-05-23T04:25:26Z

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.

flx42 closed this as completed Dec 8, 2017

flx42 reopened this Dec 8, 2017

nvjmayo added the enhancement label Jul 20, 2020

Meoop added a commit to Meoop/k8s-device-plugin that referenced this issue Dec 8, 2020

Merge pull request NVIDIA#12 from Meoop/helm

fe06c5b

chore(*): helm chart

klueska added feature-request and removed enhancement labels Jan 25, 2024

ArangoGutierrez added feature issue/PR that proposes a new feature or functionality and removed feature-request labels Feb 22, 2024

github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhance Nvidia Device plugin with more health checking features #12

Enhance Nvidia Device plugin with more health checking features #12

jiayingz commented Dec 8, 2017

jiayingz commented Dec 8, 2017

flx42 commented Dec 8, 2017 •

edited

ScorpioCPH commented Dec 13, 2017

wsxiaozhang commented Jun 19, 2019

github-actions bot commented May 23, 2024

Enhance Nvidia Device plugin with more health checking features #12

Enhance Nvidia Device plugin with more health checking features #12

Comments

jiayingz commented Dec 8, 2017

jiayingz commented Dec 8, 2017

flx42 commented Dec 8, 2017 • edited

ScorpioCPH commented Dec 13, 2017

wsxiaozhang commented Jun 19, 2019

github-actions bot commented May 23, 2024

flx42 commented Dec 8, 2017 •

edited