-
Notifications
You must be signed in to change notification settings - Fork 684
Description
This feature request aims to enhance the Node Problem Detector with the ability to monitor GPUs on nodes and detect issues.
Currently NPD does not have direct visibility into GPUs. However, many workloads are GPU accelerated which makes GPU health an important part of node health. e.g. GPUs are widely used in machine learning training and inference. Especially for LLM training which may using tens of thousands of GPU cards. The entire training cluster should be restarted from previous checkpoint if any one of the GPUs in the cluster is gone bad.
This feature request adds the following capabilities:
-
GPU device monitoring: NPD will collect GPU device info periodically and look for crashes or errors via nvidia-smi/nvml/dcgm tools.
-
GPU device monitoring: NPD will check GPU device info periodically to detect if a GPU is "stuck" (e.g. nvidia-smi command hangs).
-
TBD: GPU runtime monitoring: NPD will check for crashes or OOM issues reported in nvidia logs.
Specifically, this feature request includes:
- Code for the gpu_monitor plugin
- A Dockerfile to build an NPD image with GPU support
- Other dependencies
Looking forward to your feedback!