KEDA GPU Scaler

Scale Kubernetes GPU workloads from real hardware metrics. No DCGM. No PromQL. Optional Prometheus metrics built in.

A KEDA External Scaler that reads NVIDIA GPU metrics directly from NVML C-bindings and autoscales your vLLM, Triton, and custom inference deployments — including scale-to-zero.

Why This Exists

Kubernetes HPA watches CPU and memory. It can't see GPU utilization. Your vLLM pod shows 8% CPU while the GPU is at 100%.

The usual fix is dcgm-exporter → Prometheus → KEDA, but that's 5 components and 15-30s of latency.

This project reads GPU metrics directly from NVML and serves them to KEDA over gRPC. 2 components, 2-4 second latency.

Why Not a Native KEDA Scaler?

Putting GPU support inside KEDA core doesn't work:

CGO Constraint: NVIDIA's Go bindings (go-nvml) require CGO_ENABLED=1. KEDA builds with CGO_ENABLED=0.
Node-Level Hardware Access: The KEDA operator runs as a central pod. NVML requires local GPU device access via libnvidia-ml.so, which only a DaemonSet on GPU nodes can provide.
Independent Release Cycle: Ship GPU scaling improvements without waiting for KEDA release cycles.

This design is documented in KEDA issue #7538.

Architecture

DaemonSet — Runs on nodes labeled with nvidia.com/gpu.present: "true".
NVML Bindings — Directly reads Streaming Multiprocessor (SM) utilization and Frame Buffer Memory via go-nvml C-bindings.
gRPC Interface — Implements externalscaler.ExternalScalerServer (IsActive, StreamIsActive, GetMetricSpec, GetMetrics) to natively integrate with the central KEDA operator.
ScaledObject Trigger — Kubernetes deployments scale up/down (including to zero) based on GPU thresholds defined in the ScaledObject.

GPU Metrics

Metric	Description	Unit
`gpu_utilization`	GPU compute (SM) utilization	% (0-100)
`memory_utilization`	GPU memory controller utilization	% (0-100)
`memory_used_mib`	GPU VRAM used	MiB
`memory_used_percent`	GPU VRAM used as percentage of total	% (0-100)
`temperature`	GPU die temperature	Celsius
`power_draw`	GPU power consumption	Watts
`pcie_tx_kbps`	PCIe transmit throughput (CPU→GPU)	KB/s
`pcie_rx_kbps`	PCIe receive throughput (GPU→CPU)	KB/s
`nvlink_tx_mbps`	NVLink transmit throughput (GPU→GPU)	MB/s
`nvlink_rx_mbps`	NVLink receive throughput (GPU→GPU)	MB/s

Pre-built Scaling Profiles

Instead of configuring raw metric thresholds, use a profile optimized for your workload:

Profile	Primary Metric	Target	Activation	Use Case
`vllm-inference`	Memory %	80	5	vLLM / LLM serving with scale-to-zero
`triton-inference`	GPU Util	75	10	NVIDIA Triton Inference Server
`training`	GPU Util	90	0	Training jobs (no scale-to-zero)
`batch`	Memory %	70	1	Batch inference with aggressive scale-down
`distributed-training`	NVLink TX	800	100	Data-parallel training on NVLink systems

Prerequisites

A Kubernetes cluster (e.g., OKE, GKE, EKS, AKS) with NVIDIA GPU worker nodes
KEDA v2.10+ installed in the cluster
NVIDIA GPU drivers and Device Plugin installed

Quick Start

1. Deploy the Scaler

Deploy the DaemonSet and gRPC service into your cluster. (Ensure KEDA is already installed.)

kubectl apply -f deploy/manifests.yaml

This deploys a DaemonSet that runs on every GPU node in your cluster, plus a ClusterIP Service for KEDA to discover it.

Or use Helm:

helm install keda-gpu-scaler deploy/helm/keda-gpu-scaler \
  --namespace keda \
  --set nodeSelector."nvidia\.com/gpu\.present"=true

2. Attach to your AI Workload

Create a ScaledObject pointing to the external scaler service:

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: vllm-inference-scaler
  namespace: ai-workloads
spec:
  scaleTargetRef:
    name: vllm-deepseek-deployment
  minReplicaCount: 1
  maxReplicaCount: 50
  triggers:
    - type: external
      metadata:
        scalerAddress: "keda-gpu-scaler.keda.svc.cluster.local:6000"
        targetGpuUtilization: "80"

Or use a pre-built profile:

triggers:
  - type: external
    metadata:
      scalerAddress: "keda-gpu-scaler.keda.svc.cluster.local:6000"
      profile: "vllm-inference"

3. Custom Configuration

Override any profile default or use raw GPU metrics directly:

triggers:
  - type: external
    metadata:
      scalerAddress: "keda-gpu-scaler.keda.svc.cluster.local:6000"
      metricType: "gpu_utilization"
      targetValue: "85"
      activationThreshold: "10"
      gpuIndex: "0"              # specific GPU index, or omit for all
      aggregation: "max"         # max, min, avg, sum across GPUs

See deploy/examples/ for ready-to-use ScaledObject manifests.

Configuration Reference

Parameter	Description	Default
`profile`	Pre-built scaling profile name	(none)
`metricType`	GPU metric to scale on	`gpu_utilization`
`targetValue`	Target metric value for scaling	`80`
`targetGpuUtilization`	Shorthand for GPU utilization target	(none)
`targetMemoryUtilization`	Shorthand for VRAM utilization target	(none)
`activationThreshold`	Value below which scale-to-zero activates	`0`
`gpuIndex`	Specific GPU index to monitor	`-1` (all GPUs)
`aggregation`	Multi-GPU aggregation: `max`, `min`, `avg`, `sum`	`max`
`pollIntervalSeconds`	Metric polling interval	`10`

Prometheus Metrics (Optional)

The scaler exposes an optional Prometheus-compatible /metrics endpoint for monitoring the scaler itself and GPU fleet health. This is independent of the KEDA scaling path — scaling works identically with or without it.

Enable/Disable

# Enabled by default on port 9090
--metrics-port=9090

# Disable entirely (zero overhead)
--metrics-port=0

Helm:

metrics:
  enabled: true   # set to false to disable
  port: 9090

Exposed Metrics

Metric	Type	Description
`keda_gpu_scaler_gpu_utilization_percent`	Gauge	GPU compute utilization (per GPU)
`keda_gpu_scaler_gpu_memory_used_bytes`	Gauge	GPU memory in use (per GPU)
`keda_gpu_scaler_gpu_memory_total_bytes`	Gauge	Total GPU memory (per GPU)
`keda_gpu_scaler_gpu_temperature_celsius`	Gauge	GPU temperature (per GPU)
`keda_gpu_scaler_gpu_power_draw_watts`	Gauge	GPU power draw (per GPU)
`keda_gpu_scaler_collections_total`	Counter	Total NVML collection calls
`keda_gpu_scaler_collection_errors_total`	Counter	Failed NVML collection calls
`keda_gpu_scaler_collection_duration_seconds`	Histogram	NVML collection latency
`keda_gpu_scaler_scaler_requests_total`	Counter	gRPC requests by method
`keda_gpu_scaler_scaler_request_errors_total`	Counter	gRPC errors by method

All per-GPU metrics are labeled with gpu_index, gpu_uuid, and gpu_name.

Kubernetes Probes

The scaler exposes liveness and readiness endpoints on a dedicated probe port:

/healthz returns 200 while the process is alive.
/readyz returns 200 after NVML initializes and the first metrics collection succeeds.

--probe-port=8081

Helm:

probes:
  enabled: true
  port: 8081

Build it Yourself

This project requires CGO_ENABLED=1 to compile the NVIDIA C-bindings.

Note

The compiled binaries (keda-gpu-scaler and gpu-metrics) dynamically link NVIDIA's NVML library and load libnvidia-ml.so at runtime. They will fail to start on any machine that does not have the NVIDIA driver installed (which provides libnvidia-ml.so) — for example, a laptop or CI runner with no NVIDIA GPU. You can still build, lint, and run the test suite without a GPU, since the tests use a mock collector (see Can I run this without a GPU?).

# Build KEDA scaler binary (requires CGO for NVML)
make build

# Build standalone GPU metrics CLI (no KEDA/gRPC needed)
make build-metrics

# Build all binaries
make build-all

# Run unit tests
make test

# Run linter
make lint

# Generate protobuf Go code
make proto

# Build and push a release image
make docker-release VERSION=v0.1.0

# Deploy to cluster
make deploy

Standalone GPU Metrics CLI

Collect GPU metrics without Kubernetes — works on bare metal, SLURM jobs, Flux jobs, Kubernetes pods, and Singularity containers. The same binary and the same JSON schema work everywhere.

Important

gpu-metrics requires libnvidia-ml.so (installed with the NVIDIA driver) on the host. On a machine without an NVIDIA driver it exits immediately with nvml init failed.

gpu-metrics                       # one-shot table output (env auto-detected)
gpu-metrics --format json         # JSON for scripting
gpu-metrics --format csv          # CSV for analysis
gpu-metrics --interval 5s         # continuous collection
gpu-metrics --device 0 --quiet    # single GPU, no logs
gpu-metrics --env slurm           # force environment (auto|k8s|slurm|flux|standalone)

The --env flag auto-detects the orchestrator by default. Detection priority: SLURM → Flux → Kubernetes → standalone.

Every environment emits the same unified JSON schema with an environment block so you can compare GPU performance across on-prem and cloud with identical tooling:

{
  "environment": { "orchestrator": "slurm", "node": "compute-01", "job_id": "123", "task_rank": 0 },
  "collected_at": "2026-06-17T10:00:00Z",
  "devices": [...]
}

SLURM — auto-detected when SLURM_JOB_ID is set; collects only the GPUs assigned to your job step:

srun --gres=gpu:2 gpu-metrics --format json

Flux — auto-detected when FLUX_JOB_ID is set; collects only the GPUs in CUDA_VISIBLE_DEVICES:

flux run -N1 -g2 gpu-metrics --format json

See HPC & Cross-Environment Metrics for full usage, and Cross-Environment Comparison Guide for comparing on-prem vs cloud GPU runs.

Or build the Docker image directly:

docker build -t your-registry/keda-gpu-scaler:v0.1.0 .
docker push your-registry/keda-gpu-scaler:v0.1.0

How It Compares

	keda-gpu-scaler	dcgm-exporter + Prometheus	Custom Metrics API
Components	1 DaemonSet (+ optional /metrics)	dcgm-exporter + Prometheus + adapter	Custom metrics server
Metric latency	Sub-second (direct NVML)	15-30s (scrape interval)	Depends on implementation
Scale-to-zero	Yes (KEDA native)	Yes (with KEDA Prometheus scaler)	Manual
Configuration	3-line ScaledObject	PromQL query per metric	Custom code
GPU metrics	10 hardware metrics	50+ DCGM metrics	Whatever you build
Dependencies	KEDA, NVIDIA drivers	KEDA, Prometheus, dcgm-exporter	Varies
Failure domain	Node-local	Centralized Prometheus	Varies

Documentation

Design Document — Architecture decisions, gRPC interface, scaling profiles, testing strategy
Migration Guide — Replace dcgm-exporter + Prometheus with keda-gpu-scaler
HPC & Cross-Environment Metrics — SLURM, Flux, Kubernetes, and standalone GPU metrics
Cross-Environment Comparison — Compare GPU performance across on-prem and cloud
FAQ — Common questions about GPU scaling, MIG, multi-GPU, scale-to-zero
Changelog — Release history

Adopters

Using keda-gpu-scaler? Add your organization to ADOPTERS.md.

Roadmap

AMD ROCm support
MIG per-instance metrics
vLLM queue depth scaling

Contributing

Contributions welcome — GPU autoscaling use cases, vendor support (AMD ROCm, Intel), or docs improvements. See CONTRIBUTING.md.

License

Apache License 2.0. See LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 120 Commits
.github		.github
cmd		cmd
deploy		deploy
docs		docs
pkg		pkg
proto		proto
tests/e2e		tests/e2e
.dockerignore		.dockerignore
.gitignore		.gitignore
.readthedocs.yaml		.readthedocs.yaml
ADOPTERS.md		ADOPTERS.md
ARCHITECTURE.md		ARCHITECTURE.md
BENCHMARKS.md		BENCHMARKS.md
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
CONTRIBUTORS.md		CONTRIBUTORS.md
Dockerfile		Dockerfile
GOVERNANCE.md		GOVERNANCE.md
LICENSE		LICENSE
MAINTAINERS		MAINTAINERS
Makefile		Makefile
README.md		README.md
RELEASE_NOTES_v0.4.0.md		RELEASE_NOTES_v0.4.0.md
ROADMAP.md		ROADMAP.md
SECURITY.md		SECURITY.md
artifacthub-repo.yml		artifacthub-repo.yml
go.mod		go.mod
go.sum		go.sum
mkdocs.yml		mkdocs.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

KEDA GPU Scaler

Why This Exists

Why Not a Native KEDA Scaler?

Architecture

GPU Metrics

Pre-built Scaling Profiles

Prerequisites

Quick Start

1. Deploy the Scaler

2. Attach to your AI Workload

3. Custom Configuration

Configuration Reference

Prometheus Metrics (Optional)

Enable/Disable

Exposed Metrics

Kubernetes Probes

Build it Yourself

Standalone GPU Metrics CLI

How It Compares

Documentation

Related

Adopters

Roadmap

Contributing

License

About

Uh oh!

Releases 5

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

KEDA GPU Scaler

Why This Exists

Why Not a Native KEDA Scaler?

Architecture

GPU Metrics

Pre-built Scaling Profiles

Prerequisites

Quick Start

1. Deploy the Scaler

2. Attach to your AI Workload

3. Custom Configuration

Configuration Reference

Prometheus Metrics (Optional)

Enable/Disable

Exposed Metrics

Kubernetes Probes

Build it Yourself

Standalone GPU Metrics CLI

How It Compares

Documentation

Related

Adopters

Roadmap

Contributing

License

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 5

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages