What's Changed
This release introduces Helm-based deployment, PCIe root topology alignment, machine CPU grouping mode, ARM64 node support, and many other changes to improve reliability and observability.
Highlights
- Helm Chart Deployment: This replaces the
install.yamlmethod, providing better configurability, helm-linting, and schema validation. - PCIe Root Topology Alignment: The driver can now scan node PCI buses (opt-in via
--expose-pcie-rootsflag) to discover and expose standardresource.kubernetes.io/pcieRootdevice attributes. This enables Kubernetes schedulers to align CPU allocations with specific PCIe roots (e.g. ensuring high-performance workloads share the same PCIe root as attached GPUs or network interfaces). Requires the alphaDRAListTypeAttributesfeature gate to be enabled in the cluster. - ARM64 Node Support: Enabling multi-arch image builds (
amd64andarm64). CPU topology discovery has been refactored to strictly usesysfsand fixes have been implemented for L3 cache discovery, SMT detection, and NUMA affinity masking on ARM64 hardware. - Machine Grouping & Opaque Parameter: Introduced a new
--group-by=machineconfiguration. In this mode, the driver exposes a single node-wide capacity device and enforces exact CPU assignments provided via the claim's opaque configuration parameters. This is being evaluated as a replacement path for theindividualmode when using external schedulers to enforce precise CPU allocation. - Enhanced Reliability and Atomicity:
- NRI Restart Recovery: Restores container CPU pinning if the container runtime restarts.
- Atomic Setup: Hardened resource allocation to write CDI configuration files before claiming devices, preventing pods from starting with missing CPU configurations.
- Safe Teardown: Hardened resource release to clean up CDI configuration before freeing devices, avoiding race conditions if a device is immediately re-allocated.
- Idempotency: Made the allocation setup idempotent to gracefully handle Kubelet retries.
- Standardized CDI: Adopted the upstream CDI Cache API for standard configuration file management.
- Observability & Security:
- Probes: Added standard
/healthzliveness and readiness probes. - Metrics: Added
/metricsprometheus endpoint for scraping driver stats.
- Probes: Added standard
- Debugging Helper (
dracpu-gatherinfo): Added a built-in helper tool to scan and print the host node's hardware topology in the driver's internal format, simplifying debugging of node-side topology detection. - Enhanced E2E Test Coverage: Added new test suites to verify NRI restart reconciliation,
/healthzendpoints, contextual logging, thedracpu-gatherinfodebugger, PCIe root attributes, and machine grouping mode. Also deflaked existing tests to improve CI reliability. - Structured Contextual Logging: Refactored logging to use structured contextual logging (
go-logr), aligning with Kubernetes logging standards.
Installation
install.yaml was the recommended install vehicle for version 0.1.0 (previous release). This version (0.2.0) still ships install.yaml as backward compatibility to ensure a smooth transition, but the preferred installation way is using helm.
We plan to fully transition to helm in the next version - install.yaml won't be shipped anymore.
helm show chart oci://registry.k8s.io/dra-driver-cpu/charts/dra-driver-cpu --version 0.2.0
Requirements
- Kubernetes version
1.36.0+(or1.34.0+with theDRAConsumableCapacityalpha feature gate enabled). - Helm v3 (for Helm-based deployment).
- Kubelet Static CPU Policy must be disabled on the nodes.
All Changes
Full Changelog: v0.1.0...v0.2.0
A huge thank you to everyone who contributed to this release:
@AutuSnow, @back1ash, @ffromani, @fmuyassarov, @gauravgahlot, @Karthik-K-N, @pravk03, @rocker-zhang
New Contributors
- @back1ash made their first contribution in #105
- @Karthik-K-N made their first contribution in #121
- @gauravgahlot made their first contribution in #117
- @rocker-zhang made their first contribution in #166
Full Changelog: v0.1.0...v0.2.0