test: add testing Nvidia docker container script#89
Merged
lixuemin2016 merged 1 commit intolinux-system-roles:mainfrom Mar 3, 2026
Merged
test: add testing Nvidia docker container script#89lixuemin2016 merged 1 commit intolinux-system-roles:mainfrom
lixuemin2016 merged 1 commit intolinux-system-roles:mainfrom
Conversation
Reviewer's GuideAdds an NVIDIA GPU validation test script and wires it into the Azure HPC role so GPU-enabled instances can be automatically validated via a Docker-based CUDA container check, while skipping GPU tests on non-GPU hosts. Sequence diagram for NVIDIA Docker GPU validation script executionsequenceDiagram
actor User
participant TestScript as test_nvidia_docker_sh
participant OS
participant Systemd
participant Docker
participant Containerd
participant GPU
User->>TestScript: Execute test_nvidia_docker_sh
rect rgb(235, 235, 255)
TestScript->>OS: Check moby-engine package
OS-->>TestScript: Installed
TestScript->>OS: Check moby-containerd package
OS-->>TestScript: Installed
TestScript->>OS: Check nvidia-container-toolkit package
OS-->>TestScript: Installed
end
rect rgb(235, 255, 235)
TestScript->>Systemd: Query containerd service status
Systemd-->>TestScript: containerd active
TestScript->>Systemd: Query docker service status
Systemd-->>TestScript: docker active
end
rect rgb(255, 245, 235)
TestScript->>GPU: Run nvidia_smi
alt GPU not present or nvidia_smi fails
GPU-->>TestScript: Error or no device
TestScript-->>User: Log skip GPU access test
else GPU present
GPU-->>TestScript: GPU info
TestScript->>Docker: Check NVIDIA runtime registration
Docker-->>TestScript: NVIDIA runtime available
TestScript->>Docker: Run CUDA container with GPU access
Docker->>GPU: Expose GPU to container
GPU-->>Docker: GPU accessible
Docker-->>TestScript: Container exited successfully
TestScript-->>User: Report GPU access PASS
end
end
Flow diagram for NVIDIA Docker GPU validation logicflowchart TD
A[Start test_nvidia_docker_sh] --> B[Check moby-engine installed]
B --> C[Check moby-containerd installed]
C --> D[Check nvidia-container-toolkit installed]
D --> E[Check containerd service is active]
E --> F[Check docker service is active]
F --> G[Run nvidia-smi to detect GPU]
G --> H{GPU detected?}
H -- No --> I[Log no GPU detected]
I --> J[Skip GPU access container test]
J --> Z[End]
H -- Yes --> K[Verify NVIDIA runtime is registered in Docker]
K --> L[Run NVIDIA CUDA container with GPU access]
L --> M{Container can access GPU?}
M -- Yes --> N[Report PASS GPU accessible]
M -- No --> O[Report FAIL GPU not accessible]
N --> Z[End]
O --> Z[End]
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Hey - I've left some high level feedback:
- Consider making the
NVIDIA_IMAGEand possibly the expected Docker runtime name configurable via environment variables or script flags so the test can be reused with different CUDA images or runtime configurations without editing the script. - The
test_nvidia_gpu_accessfunction exits with code 77 on skip; double-check that this code is correctly interpreted as a skipped test by your surrounding harness, or align it with whatever skip convention the rest of the test suite uses.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- Consider making the `NVIDIA_IMAGE` and possibly the expected Docker runtime name configurable via environment variables or script flags so the test can be reused with different CUDA images or runtime configurations without editing the script.
- The `test_nvidia_gpu_access` function exits with code 77 on skip; double-check that this code is correctly interpreted as a skipped test by your surrounding harness, or align it with whatever skip convention the rest of the test suite uses.Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
spetrosi
approved these changes
Mar 3, 2026
Add Nvidia Docker container GPU access validation as below: - Add test packages installation for moby-engine,moby-containerd and nvidia-container-toolkit - Validate that containerd and docker services are active - Detect GPU hardware presence before running GPU container test - Skip GPU container test if no GPU is detected from VM - Run NVIDIA Docker container to verify GPU access JIRA: RHELHPC-120 Signed-off-by: Xuemin Li <xuli@redhat.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Enhancement:
Add Nvidia Docker container GPU access validation as below:
Reason:
Add Nvidia Docker container GPU related tests
Result:
For the instance without GPU: e.g. Standard D2ds v4
`bash test-nvidia-docker.sh
[2026-03-03 01:54:03] ==========================================================
[2026-03-03 01:54:03] NVIDIA Container Runtime Test
[2026-03-03 01:54:03] ==========================================================
[2026-03-03 01:54:03] Test: Container runtime packages installation...
Checking: moby-engine package is installed
[PASS] moby-engine package is installed
Checking: moby-containerd package is installed
[PASS] moby-containerd package is installed
Checking: nvidia-container-toolkit package is installed
[PASS] nvidia-container-toolkit package is installed
[2026-03-03 01:54:03] ==========================================
[2026-03-03 01:54:03] Service Status Tests
[2026-03-03 01:54:03] ==========================================
[2026-03-03 01:54:03] Test: containerd service status...
Checking: containerd service is active
[PASS] containerd service is active
[2026-03-03 01:54:03] Test: Docker service status...
Checking: Docker service is active
[PASS] Docker service is active
[2026-03-03 01:54:03] ==========================================
[2026-03-03 01:54:03] GPU Access Test
[2026-03-03 01:54:03] ==========================================
[2026-03-03 01:54:03] Detecting GPU hardware...
Checking: GPU hardware presence
[INFO] nvidia-smi command failed - no GPU hardware detected
[2026-03-03 01:54:03] Test: GPU access in Docker container...
[SKIP] No GPU hardware detected - cannot test GPU access
`
If have GPU access, e.g. test on instance Standard NC4as T4 v3, will get PASS result.
`
[2026-03-03 02:28:57] ==========================================
[2026-03-03 02:28:57] GPU Access Test
[2026-03-03 02:28:57] ==========================================
[2026-03-03 02:28:57] Detecting GPU hardware...
Checking: GPU hardware presence
GPU detected
[2026-03-03 02:28:57] Test: GPU access in Docker container...
Checking: NVIDIA runtime is registered in Docker
[PASS] NVIDIA runtime is registered in Docker
Checking: GPU is accessible from Docker container
[PASS] GPU is accessible from Docker container
`
Issue Tracker Tickets (Jira or BZ if any):
JIRA: RHELHPC-120
Summary by Sourcery
Add a scripted NVIDIA Docker GPU validation test to the Azure HPC role and install it on configured systems.
New Features:
Tests: