Skip to content

Feature Request: GPU Support #833

@ZongqiangZhang

Description

@ZongqiangZhang

This feature request aims to enhance the Node Problem Detector with the ability to monitor GPUs on nodes and detect issues.

Currently NPD does not have direct visibility into GPUs. However, many workloads are GPU accelerated which makes GPU health an important part of node health. e.g. GPUs are widely used in machine learning training and inference. Especially for LLM training which may using tens of thousands of GPU cards. The entire training cluster should be restarted from previous checkpoint if any one of the GPUs in the cluster is gone bad.

This feature request adds the following capabilities:

  • GPU device monitoring: NPD will collect GPU device info periodically and look for crashes or errors via nvidia-smi/nvml/dcgm tools.

  • GPU device monitoring: NPD will check GPU device info periodically to detect if a GPU is "stuck" (e.g. nvidia-smi command hangs).

  • TBD: GPU runtime monitoring: NPD will check for crashes or OOM issues reported in nvidia logs.

Specifically, this feature request includes:

  • Code for the gpu_monitor plugin
  • A Dockerfile to build an NPD image with GPU support
  • Other dependencies

Looking forward to your feedback!

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/featureCategorizes issue or PR as related to a new feature.lifecycle/frozenIndicates that an issue or PR should not be auto-closed due to staleness.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions