GPU NUMA-Aware Scheduling — E2E Verification

Automated test scripts for verifying Volcano GPU NUMA topology-aware scheduling.

PRs under test:

volcano-sh/volcano#5095 — numaaware: add GPU NUMA topology awareness to scheduler
volcano-sh/resource-exporter#12 — numatopo: add GPU NUMA topology discovery via sysfs
volcano-sh/apis#229 — api: add GPUInfo type and GPUDetail field to NumatopoSpec

Issue: volcano-sh/volcano#4998

Option A: Test on Existing GPU Cluster (Recommended)

If you already have a Kubernetes cluster with GPU nodes, use this single script.

Prerequisites

Kubernetes cluster with GPU node (2+ NUMA nodes, 4+ GPUs)
NVIDIA device plugin installed
kubelet Topology Manager set to best-effort or restricted
kubectl configured, docker and go 1.23+ installed

Quick Start

git clone https://github.com/pmady/gpu-numa-test.git  # or copy scripts
cd gpu-numa-test

# Full run: build → deploy → test
./test-existing-cluster.sh

# If Volcano images are already loaded:
./test-existing-cluster.sh --skip-build

# Run only the test suite:
./test-existing-cluster.sh --skip-build --skip-deploy

# Cleanup all test resources:
./test-existing-cluster.sh --cleanup

What It Does

Pre-flight checks — verifies kubectl, GPU nodes, build tools
Topology probe — deploys a privileged pod to read GPU-to-NUMA mapping via sysfs
Builds images — clones PR branches, builds vc-scheduler and resource-exporter
Deploys Volcano — with numaaware plugin enabled + resource-exporter DaemonSet
Runs test jobs — 2-GPU job (prefer single NUMA) + 4-GPU job (cross-NUMA)
Checks scheduler logs — for NUMA scoring/hint entries
Prints screenshot checklist — evidence to post on the PR

Screenshot Checklist (for PR comment)

nvidia-smi topo -m
kubectl get numatopologies -A -o yaml
kubectl get vcjob -o wide
kubectl logs <2gpu-pod>
kubectl logs <scheduler-pod> -n volcano-system | grep -i numa

Option B: Create a GCP GPU VM from Scratch

If you don't have a GPU cluster, this creates one on GCP with spot pricing.

Prerequisites

GCP account with billing enabled ($300 free credit for new accounts)
GPU quota: NVIDIA_T4_GPUS ≥ 4 in us-central1
gcloud CLI installed and authenticated

Usage

# Default: 4x T4 in us-central1-a (spot pricing ~$2/hr)
./gpu-numa-test.sh

# Custom project/zone
./gpu-numa-test.sh --project my-project --zone us-east1-c

# Use A100 GPUs
./gpu-numa-test.sh --gpu-type nvidia-tesla-a100

Phases

Phase	Duration	Description
Create VM	~2 min	GCP spot VM: n1-standard-32 + 4× T4
Install drivers	~10 min	NVIDIA drivers + containerd + reboot
Setup K8s	~5 min	kubeadm + topology manager + device plugin
Build Volcano	~10 min	Build from PR branches, deploy
Run tests	~5 min	7 automated PASS/FAIL tests
Wait for you	—	Take screenshots, then type `go-ahead`
Cleanup	~1 min	Deletes all GCP resources

Estimated cost: ~$2-4 (spot), $0 with free credit

Interactive Commands

Command	Action
`go-ahead`	Delete VM and stop billing
`cost`	Show elapsed time and cost estimate
`ssh`	Print SSH command

File Structure

gpu-numa-test/
├── test-existing-cluster.sh            # For existing GPU clusters (Option A)
├── gpu-numa-test.sh                    # GCP VM orchestrator (Option B)
├── scripts/
│   ├── vm-setup.sh                     # NVIDIA + K8s install (GCP VM)
│   ├── build-volcano.sh                # Build & deploy from PR branches
│   └── run-tests.sh                    # Standalone test suite
├── manifests/
│   ├── test-gpu-numa-job.yaml          # 2-GPU test (single NUMA preferred)
│   ├── test-gpu-cross-numa-job.yaml    # 4-GPU test (cross-NUMA)
│   ├── volcano-scheduler-config.yaml   # Scheduler config with numaaware
│   └── resource-exporter-daemonset.yaml
└── README.md

Results Directory

After running, results are saved to /tmp/volcano-gpu-numa-test/results/:

File	Contents
`topology-probe.txt`	GPU-to-NUMA mapping from sysfs
`numatopology-full.yaml`	Numatopology CRD with GPU data
`job-2gpu.txt`	2-GPU job output
`job-4gpu.txt`	4-GPU job output
`scheduler-numa-logs.txt`	Scheduler NUMA scoring entries

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GPU NUMA-Aware Scheduling — E2E Verification

Option A: Test on Existing GPU Cluster (Recommended)

Prerequisites

Quick Start

What It Does

Screenshot Checklist (for PR comment)

Option B: Create a GCP GPU VM from Scratch

Prerequisites

Usage

Phases

Interactive Commands

File Structure

Results Directory

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
manifests		manifests
scripts		scripts
README.md		README.md
gpu-numa-test.sh		gpu-numa-test.sh
test-existing-cluster.sh		test-existing-cluster.sh

Folders and files

Latest commit

History

Repository files navigation

GPU NUMA-Aware Scheduling — E2E Verification

Option A: Test on Existing GPU Cluster (Recommended)

Prerequisites

Quick Start

What It Does

Screenshot Checklist (for PR comment)

Option B: Create a GCP GPU VM from Scratch

Prerequisites

Usage

Phases

Interactive Commands

File Structure

Results Directory

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages