Skip to content

v0.2.0

Latest

Choose a tag to compare

@pravk03 pravk03 released this 22 Jun 22:07
7f95568

What's Changed

This release introduces Helm-based deployment, PCIe root topology alignment, machine CPU grouping mode, ARM64 node support, and many other changes to improve reliability and observability.

Highlights

  • Helm Chart Deployment: This replaces the install.yaml method, providing better configurability, helm-linting, and schema validation.
  • PCIe Root Topology Alignment: The driver can now scan node PCI buses (opt-in via --expose-pcie-roots flag) to discover and expose standard resource.kubernetes.io/pcieRoot device attributes. This enables Kubernetes schedulers to align CPU allocations with specific PCIe roots (e.g. ensuring high-performance workloads share the same PCIe root as attached GPUs or network interfaces). Requires the alpha DRAListTypeAttributes feature gate to be enabled in the cluster.
  • ARM64 Node Support: Enabling multi-arch image builds (amd64 and arm64). CPU topology discovery has been refactored to strictly use sysfs and fixes have been implemented for L3 cache discovery, SMT detection, and NUMA affinity masking on ARM64 hardware.
  • Machine Grouping & Opaque Parameter: Introduced a new --group-by=machine configuration. In this mode, the driver exposes a single node-wide capacity device and enforces exact CPU assignments provided via the claim's opaque configuration parameters. This is being evaluated as a replacement path for the individual mode when using external schedulers to enforce precise CPU allocation.
  • Enhanced Reliability and Atomicity:
    • NRI Restart Recovery: Restores container CPU pinning if the container runtime restarts.
    • Atomic Setup: Hardened resource allocation to write CDI configuration files before claiming devices, preventing pods from starting with missing CPU configurations.
    • Safe Teardown: Hardened resource release to clean up CDI configuration before freeing devices, avoiding race conditions if a device is immediately re-allocated.
    • Idempotency: Made the allocation setup idempotent to gracefully handle Kubelet retries.
    • Standardized CDI: Adopted the upstream CDI Cache API for standard configuration file management.
  • Observability & Security:
    • Probes: Added standard /healthz liveness and readiness probes.
    • Metrics: Added /metrics prometheus endpoint for scraping driver stats.
  • Debugging Helper (dracpu-gatherinfo): Added a built-in helper tool to scan and print the host node's hardware topology in the driver's internal format, simplifying debugging of node-side topology detection.
  • Enhanced E2E Test Coverage: Added new test suites to verify NRI restart reconciliation, /healthz endpoints, contextual logging, the dracpu-gatherinfo debugger, PCIe root attributes, and machine grouping mode. Also deflaked existing tests to improve CI reliability.
  • Structured Contextual Logging: Refactored logging to use structured contextual logging (go-logr), aligning with Kubernetes logging standards.

Installation

install.yaml was the recommended install vehicle for version 0.1.0 (previous release). This version (0.2.0) still ships install.yaml as backward compatibility to ensure a smooth transition, but the preferred installation way is using helm.
We plan to fully transition to helm in the next version - install.yaml won't be shipped anymore.

helm show chart oci://registry.k8s.io/dra-driver-cpu/charts/dra-driver-cpu --version 0.2.0

Requirements

  • Kubernetes version 1.36.0+ (or 1.34.0+ with the DRAConsumableCapacity alpha feature gate enabled).
  • Helm v3 (for Helm-based deployment).
  • Kubelet Static CPU Policy must be disabled on the nodes.

All Changes

Full Changelog: v0.1.0...v0.2.0

A huge thank you to everyone who contributed to this release:
@AutuSnow, @back1ash, @ffromani, @fmuyassarov, @gauravgahlot, @Karthik-K-N, @pravk03, @rocker-zhang

New Contributors

Full Changelog: v0.1.0...v0.2.0