Skip to content

OCPNODE-4538: Add e2e tests for DRA Partitionable Devices (KEP-4815)#31230

Draft
sabujmaity wants to merge 1 commit into
openshift:mainfrom
sabujmaity:feat/OCPNODE-4538-dra-partitionable-devices-e2e
Draft

OCPNODE-4538: Add e2e tests for DRA Partitionable Devices (KEP-4815)#31230
sabujmaity wants to merge 1 commit into
openshift:mainfrom
sabujmaity:feat/OCPNODE-4538-dra-partitionable-devices-e2e

Conversation

@sabujmaity
Copy link
Copy Markdown
Contributor

@sabujmaity sabujmaity commented May 28, 2026

Summary

Adds downstream e2e tests for the DRAPartitionableDevices feature (KEP-4815)
using the upstream dra-example-driver with kubeletPlugin.gpuPartitions enabled.
Tests:

  • should publish ResourceSlices with SharedCounters and ConsumesCounters
  • should allocate partition device to pod via DRA
  • should mark pod unschedulable when all counters are exhausted on a node

Architecture:

  • Reuses existing dra-example-driver install from OCPNODE-4108
  • Helm upgrade enables partitioning (numDevices=2, gpuPartitions=4)
  • Tests auto-skip when DRAPartitionableDevices feature gate is disabled
  • AfterAll restores driver to default config
    Gating: [OCPFeatureGate:DRAPartitionableDevices] - Prow auto-skips on clusters
    without the gate enabled.

JIRA

https://issues.redhat.com/browse/OCPNODE-4538

Summary by CodeRabbit

  • Tests

    • Added comprehensive end-to-end test suite for the DRA PartitionableDevices feature, including validation of shared counters, device allocation to multiple pods, and counter-exhaustion scheduling scenarios.
  • Chores

    • Introduced counter validation helpers and improved prerequisite installer logic for DRA testing infrastructure.

Add three e2e tests validating the DRAPartitionableDevices feature gate
using the upstream dra-example-driver with gpuPartitions enabled:
1. Validates ResourceSlice two-slice model (SharedCounters + ConsumesCounters)
2. Validates partition device allocation to pod via DRA ResourceClaim
3. Validates counter exhaustion renders additional claims unschedulable
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label May 28, 2026
@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 28, 2026
@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented May 28, 2026

@sabujmaity: This pull request references OCPNODE-4538 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "5.0.0" version, but no target version was set.

Details

In response to this:

Summary

Adds downstream e2e tests for the DRAPartitionableDevices feature (KEP-4815)
using the upstream dra-example-driver with kubeletPlugin.gpuPartitions enabled.
Tests:

  • should publish ResourceSlices with SharedCounters and ConsumesCounters
  • should allocate partition device to pod via DRA
  • should mark pod unschedulable when all counters are exhausted on a node
    Architecture:
  • Reuses existing dra-example-driver install from OCPNODE-4108
  • Helm upgrade enables partitioning (numDevices=2, gpuPartitions=4)
  • Tests auto-skip when DRAPartitionableDevices feature gate is disabled
  • AfterAll restores driver to default config
    Gating: [OCPFeatureGate:DRAPartitionableDevices] — Prow auto-skips on clusters
    without the gate enabled.

JIRA

https://issues.redhat.com/browse/OCPNODE-4538

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 28, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 28, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: sabujmaity
Once this PR has been reviewed and has the lgtm label, please assign bertinatto for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 28, 2026

Walkthrough

This PR adds a complete E2E test suite for the DRA PartitionableDevices feature (KEP-4815). It includes a ResourceSlice validation helper, enhancements to driver installation infrastructure supporting Helm upgrades and namespace cleanup, and three test scenarios validating shared counters, device allocation, and capacity exhaustion.

Changes

PartitionableDevices Feature Testing

Layer / File(s) Summary
Counter Validator Helper
test/extended/node/dra/common/counter_validator.go
Introduces CounterValidator for ResourceSlice validation. Splits slices by SharedCounters vs Devices, validates counter presence with non-zero values, verifies device counter consumption, counts partition-named devices, and selects schedulable nodes while filtering tainted ones.
Prerequisites Installer Enhancements
test/extended/node/dra/example/prerequisites_installer.go
Extends installer with pre-cleanup via ensureNamespaceGone (polling until fully removed), git availability verification, refactored Helm args via commonHelmArgs, new exported HelmUpgrade method with chart path resolution, and helm-first rollback with terminating-state awareness in IsDriverInstalled and RollbackMutations.
Partitionable Test Suite and Scenarios
test/extended/node/dra/partitionable/partitionable_dra.go
Implements E2E test suite with Ginkgo contexts that configure driver partitioning (2 GPUs, 4 partitions per GPU), validates SharedCounters and ConsumesCounters via counter validator, tests Pod allocation with partition device requests and name verification, and validates capacity exhaustion with Unschedulable condition messaging.
Test Module Registration
test/extended/include.go
Adds blank import for test/extended/node/dra/partitionable to register the test suite for discovery.
OWNERS Configuration
test/extended/node/dra/partitionable/OWNERS
Defines approvers, reviewers, and labels (sig/scheduling, area/dra) for the partitionable test directory.

Sequence Diagram

sequenceDiagram
  participant TestSuite as Test Suite
  participant PrerequisitesInstaller
  participant DriverConfig as Driver<br/>(Helm)
  participant CounterValidator
  participant DeviceClass
  participant Pod
  TestSuite->>PrerequisitesInstaller: InstallAll with cleanup
  TestSuite->>DriverConfig: HelmUpgrade (partition mode)
  TestSuite->>CounterValidator: ValidateSharedCounters
  CounterValidator->>DeviceClass: List ResourceSlices
  TestSuite->>DeviceClass: Create with requests
  TestSuite->>Pod: Create with DeviceClaim
  Pod->>Pod: Allocate partition devices
  TestSuite->>Pod: Validate "partition" in names
  TestSuite->>CounterValidator: Verify consumption
  TestSuite->>Pod: Create exhaustion pod (pending)
  Pod->>Pod: Unschedulable (insufficient)
  TestSuite->>DriverConfig: HelmUpgrade (restore non-partition)
Loading

🎯 3 (Moderate) | ⏱️ ~25 minutes


Caution

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

  • Ignore

❌ Failed checks (1 error, 2 warnings)

Check name Status Explanation Resolution
No-Sensitive-Data-In-Logs ❌ Error Logging exposes command output and Helm values that could contain sensitive data: helm/git outputs in errors/warnings (lines 95,102,127,279,309,155,435,465) and setValues logged directly (line 305). Remove or sanitize helm/git command output logging; avoid logging Helm setValues without filtering sensitive values.
Test Structure And Quality ⚠️ Warning Seven ExpectNoError assertions lack error messages. Exhaustion test uses GetNodeWithDevices which may return tainted nodes, risking scheduling failures from taints rather than counter exhaustion. Add messages to 7 bare ExpectNoError calls. Check node taints after GetNodeWithDevices; skip test if NoSchedule or NoExecute taints exist.
Single Node Openshift (Sno) Test Compatibility ⚠️ Warning Test 3 lacks SNO compatibility protection. It uses GetNodeWithDevices() which may return tainted nodes on SNO, causing pod scheduling failures unrelated to counter exhaustion being tested. Add [Skipped:SingleReplicaTopology] label to test name, or add exutil.IsSingleNode() check with g.Skip() in BeforeEach, or validate selected node is untainted before using it in Test 3.
✅ Passed checks (12 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The pull request title clearly and specifically identifies the main change: adding end-to-end tests for the DRA Partitionable Devices feature with appropriate JIRA reference and KEP identifier.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed All test declarations (Describe, Context, It, When) in partitionable_dra.go use static, deterministic strings with no dynamic values like node names, pod names, namespaces, timestamps, or UUIDs.
Microshift Test Compatibility ✅ Passed Test has proper MicroShift protection via BeforeEach (lines 62-67) using exutil.IsMicroShiftCluster() with g.Skip(). Uses only standard K8s APIs (core/v1, resource/v1).
Topology-Aware Scheduling Compatibility ✅ Passed This PR adds only e2e test code under test/extended/, not production deployment manifests, operator code, or controllers. The topology-aware scheduling check does not apply to test infrastructure.
Ote Binary Stdout Contract ✅ Passed All logging uses framework.Logf(), not direct stdout writes; no fmt.Print/log.Print found; test suite properly declared within g.Describe() closure with no module-level side effects.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed Test suite contains [Skipped:Disconnected] in Ginkgo Describe (line 50) and has no IPv4 assumptions. External git clone requirement is properly handled by the skip marker.
No-Weak-Crypto ✅ Passed No weak cryptography, custom crypto implementations, or non-constant-time secret comparisons found in any of the PR's 5 modified files; the code focuses entirely on DRA partitionable device testing.
Container-Privileges ✅ Passed Test code has no privileged container specs. Pods use restrictive SecurityContext (AllowPrivilegeEscalation=false, drop ALL). SCC grant is pre-existing infrastructure requirement.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Warning

Review ran into problems

🔥 Problems

Git: Failed to clone repository. Please run the @coderabbitai full review command to re-trigger a full review. If the issue persists, set path_filters to include or exclude specific files.


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
test/extended/node/dra/example/prerequisites_installer.go (1)

164-172: 💤 Low value

Consider logging unexpected API errors in the poll loop.

When getErr is non-nil but not NotFound, the current code logs "still exists, waiting for GC" which is misleading if the actual error is a network or auth failure. While this resilience pattern is reasonable for cleanup, logging the actual error would aid debugging.

♻️ Suggested improvement
 return wait.PollUntilContextTimeout(ctx, 3*time.Second, 3*time.Minute, true, func(ctx context.Context) (bool, error) {
     _, getErr := pi.client.CoreV1().Namespaces().Get(ctx, driverNamespace, metav1.GetOptions{})
     if errors.IsNotFound(getErr) {
         framework.Logf("Namespace %s fully removed", driverNamespace)
         return true, nil
     }
+    if getErr != nil {
+        framework.Logf("Error checking namespace %s (will retry): %v", driverNamespace, getErr)
+        return false, nil
+    }
     framework.Logf("Namespace %s still exists, waiting for GC...", driverNamespace)
     return false, nil
 })
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/extended/node/dra/example/prerequisites_installer.go` around lines 164 -
172, The poll callback in wait.PollUntilContextTimeout that calls
pi.client.CoreV1().Namespaces().Get currently treats any non-NotFound error as
"still exists" which is misleading; modify the anonymous func used by
wait.PollUntilContextTimeout (the closure referencing driverNamespace and
getErr) to check if getErr != nil and !errors.IsNotFound(getErr) and, in that
branch, log the actual getErr (e.g., using framework.Logf or the existing
logger) with context before returning false,nil so retries continue—ensure you
reference the same getErr, driverNamespace, and the poll closure so only the
logging behavior changes.
test/extended/node/dra/common/counter_validator.go (1)

33-54: 💤 Low value

Docstring claims "no Devices" constraint not enforced by code.

The docstring states counter slices have "SharedCounters, no Devices", but the implementation only checks for presence of SharedCounters. A slice with both would appear in both lists. While conforming drivers use the two-slice model, the code doesn't enforce the documented invariant.

Consider either updating the docstring to reflect actual behavior (categorizes by presence of each field) or adding the exclusion check if strict separation is intended.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/extended/node/dra/common/counter_validator.go` around lines 33 - 54, The
docstring for GetResourceSlicesByType promises "SharedCounters, no Devices" but
the implementation only checks SharedCounters and allows slices with both fields
to be listed in both outputs; update the logic in GetResourceSlicesByType so
counterSlices only includes slices where slice.Spec.SharedCounters is non-empty
AND slice.Spec.Devices is empty (i.e., use the condition on
slice.Spec.SharedCounters and slice.Spec.Devices), keep deviceSlices as slices
with slice.Spec.Devices non-empty, and update the function docstring to match
the enforced invariant; reference: GetResourceSlicesByType,
slice.Spec.SharedCounters, slice.Spec.Devices.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@test/extended/node/dra/partitionable/partitionable_dra.go`:
- Around line 232-287: GetNodeWithDevices() can return a tainted fallback node
which causes the pinned pod to fail scheduling for taint reasons; after calling
counterValidator.GetNodeWithDevices(ctx) (and getting nodeName) fetch the Node
object via oc.KubeFramework().ClientSet.CoreV1().Nodes().Get(...) and inspect
node.Spec.Taints, and if any non-tolerable taints exist, iterate available
device-capable nodes (use counterValidator or list nodes with device resource
slices) to pick an untainted nodeName, updating exhaustPod.Spec.NodeSelector
accordingly; if no untainted node is available, fail the test with a clear
message.

---

Nitpick comments:
In `@test/extended/node/dra/common/counter_validator.go`:
- Around line 33-54: The docstring for GetResourceSlicesByType promises
"SharedCounters, no Devices" but the implementation only checks SharedCounters
and allows slices with both fields to be listed in both outputs; update the
logic in GetResourceSlicesByType so counterSlices only includes slices where
slice.Spec.SharedCounters is non-empty AND slice.Spec.Devices is empty (i.e.,
use the condition on slice.Spec.SharedCounters and slice.Spec.Devices), keep
deviceSlices as slices with slice.Spec.Devices non-empty, and update the
function docstring to match the enforced invariant; reference:
GetResourceSlicesByType, slice.Spec.SharedCounters, slice.Spec.Devices.

In `@test/extended/node/dra/example/prerequisites_installer.go`:
- Around line 164-172: The poll callback in wait.PollUntilContextTimeout that
calls pi.client.CoreV1().Namespaces().Get currently treats any non-NotFound
error as "still exists" which is misleading; modify the anonymous func used by
wait.PollUntilContextTimeout (the closure referencing driverNamespace and
getErr) to check if getErr != nil and !errors.IsNotFound(getErr) and, in that
branch, log the actual getErr (e.g., using framework.Logf or the existing
logger) with context before returning false,nil so retries continue—ensure you
reference the same getErr, driverNamespace, and the poll closure so only the
logging behavior changes.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 46dab00d-6005-4d5a-9fbc-233ea24214ae

📥 Commits

Reviewing files that changed from the base of the PR and between a29f970 and 999ccf0.

📒 Files selected for processing (5)
  • test/extended/include.go
  • test/extended/node/dra/common/counter_validator.go
  • test/extended/node/dra/example/prerequisites_installer.go
  • test/extended/node/dra/partitionable/OWNERS
  • test/extended/node/dra/partitionable/partitionable_dra.go

Comment thread test/extended/node/dra/partitionable/partitionable_dra.go
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants