OCPNODE-4538: Add e2e tests for DRA Partitionable Devices (KEP-4815)#31230
OCPNODE-4538: Add e2e tests for DRA Partitionable Devices (KEP-4815)#31230sabujmaity wants to merge 1 commit into
Conversation
Add three e2e tests validating the DRAPartitionableDevices feature gate using the upstream dra-example-driver with gpuPartitions enabled: 1. Validates ResourceSlice two-slice model (SharedCounters + ConsumesCounters) 2. Validates partition device allocation to pod via DRA ResourceClaim 3. Validates counter exhaustion renders additional claims unschedulable
|
@sabujmaity: This pull request references OCPNODE-4538 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "5.0.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Skipping CI for Draft Pull Request. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: sabujmaity The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
WalkthroughThis PR adds a complete E2E test suite for the DRA PartitionableDevices feature (KEP-4815). It includes a ResourceSlice validation helper, enhancements to driver installation infrastructure supporting Helm upgrades and namespace cleanup, and three test scenarios validating shared counters, device allocation, and capacity exhaustion. ChangesPartitionableDevices Feature Testing
Sequence DiagramsequenceDiagram
participant TestSuite as Test Suite
participant PrerequisitesInstaller
participant DriverConfig as Driver<br/>(Helm)
participant CounterValidator
participant DeviceClass
participant Pod
TestSuite->>PrerequisitesInstaller: InstallAll with cleanup
TestSuite->>DriverConfig: HelmUpgrade (partition mode)
TestSuite->>CounterValidator: ValidateSharedCounters
CounterValidator->>DeviceClass: List ResourceSlices
TestSuite->>DeviceClass: Create with requests
TestSuite->>Pod: Create with DeviceClaim
Pod->>Pod: Allocate partition devices
TestSuite->>Pod: Validate "partition" in names
TestSuite->>CounterValidator: Verify consumption
TestSuite->>Pod: Create exhaustion pod (pending)
Pod->>Pod: Unschedulable (insufficient)
TestSuite->>DriverConfig: HelmUpgrade (restore non-partition)
🎯 3 (Moderate) | ⏱️ ~25 minutes Caution Pre-merge checks failedPlease resolve all errors before merging. Addressing warnings is optional.
❌ Failed checks (1 error, 2 warnings)
✅ Passed checks (12 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Warning Review ran into problems🔥 ProblemsGit: Failed to clone repository. Please run the Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
test/extended/node/dra/example/prerequisites_installer.go (1)
164-172: 💤 Low valueConsider logging unexpected API errors in the poll loop.
When
getErris non-nil but notNotFound, the current code logs "still exists, waiting for GC" which is misleading if the actual error is a network or auth failure. While this resilience pattern is reasonable for cleanup, logging the actual error would aid debugging.♻️ Suggested improvement
return wait.PollUntilContextTimeout(ctx, 3*time.Second, 3*time.Minute, true, func(ctx context.Context) (bool, error) { _, getErr := pi.client.CoreV1().Namespaces().Get(ctx, driverNamespace, metav1.GetOptions{}) if errors.IsNotFound(getErr) { framework.Logf("Namespace %s fully removed", driverNamespace) return true, nil } + if getErr != nil { + framework.Logf("Error checking namespace %s (will retry): %v", driverNamespace, getErr) + return false, nil + } framework.Logf("Namespace %s still exists, waiting for GC...", driverNamespace) return false, nil })🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/extended/node/dra/example/prerequisites_installer.go` around lines 164 - 172, The poll callback in wait.PollUntilContextTimeout that calls pi.client.CoreV1().Namespaces().Get currently treats any non-NotFound error as "still exists" which is misleading; modify the anonymous func used by wait.PollUntilContextTimeout (the closure referencing driverNamespace and getErr) to check if getErr != nil and !errors.IsNotFound(getErr) and, in that branch, log the actual getErr (e.g., using framework.Logf or the existing logger) with context before returning false,nil so retries continue—ensure you reference the same getErr, driverNamespace, and the poll closure so only the logging behavior changes.test/extended/node/dra/common/counter_validator.go (1)
33-54: 💤 Low valueDocstring claims "no Devices" constraint not enforced by code.
The docstring states counter slices have "SharedCounters, no Devices", but the implementation only checks for presence of SharedCounters. A slice with both would appear in both lists. While conforming drivers use the two-slice model, the code doesn't enforce the documented invariant.
Consider either updating the docstring to reflect actual behavior (categorizes by presence of each field) or adding the exclusion check if strict separation is intended.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/extended/node/dra/common/counter_validator.go` around lines 33 - 54, The docstring for GetResourceSlicesByType promises "SharedCounters, no Devices" but the implementation only checks SharedCounters and allows slices with both fields to be listed in both outputs; update the logic in GetResourceSlicesByType so counterSlices only includes slices where slice.Spec.SharedCounters is non-empty AND slice.Spec.Devices is empty (i.e., use the condition on slice.Spec.SharedCounters and slice.Spec.Devices), keep deviceSlices as slices with slice.Spec.Devices non-empty, and update the function docstring to match the enforced invariant; reference: GetResourceSlicesByType, slice.Spec.SharedCounters, slice.Spec.Devices.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@test/extended/node/dra/partitionable/partitionable_dra.go`:
- Around line 232-287: GetNodeWithDevices() can return a tainted fallback node
which causes the pinned pod to fail scheduling for taint reasons; after calling
counterValidator.GetNodeWithDevices(ctx) (and getting nodeName) fetch the Node
object via oc.KubeFramework().ClientSet.CoreV1().Nodes().Get(...) and inspect
node.Spec.Taints, and if any non-tolerable taints exist, iterate available
device-capable nodes (use counterValidator or list nodes with device resource
slices) to pick an untainted nodeName, updating exhaustPod.Spec.NodeSelector
accordingly; if no untainted node is available, fail the test with a clear
message.
---
Nitpick comments:
In `@test/extended/node/dra/common/counter_validator.go`:
- Around line 33-54: The docstring for GetResourceSlicesByType promises
"SharedCounters, no Devices" but the implementation only checks SharedCounters
and allows slices with both fields to be listed in both outputs; update the
logic in GetResourceSlicesByType so counterSlices only includes slices where
slice.Spec.SharedCounters is non-empty AND slice.Spec.Devices is empty (i.e.,
use the condition on slice.Spec.SharedCounters and slice.Spec.Devices), keep
deviceSlices as slices with slice.Spec.Devices non-empty, and update the
function docstring to match the enforced invariant; reference:
GetResourceSlicesByType, slice.Spec.SharedCounters, slice.Spec.Devices.
In `@test/extended/node/dra/example/prerequisites_installer.go`:
- Around line 164-172: The poll callback in wait.PollUntilContextTimeout that
calls pi.client.CoreV1().Namespaces().Get currently treats any non-NotFound
error as "still exists" which is misleading; modify the anonymous func used by
wait.PollUntilContextTimeout (the closure referencing driverNamespace and
getErr) to check if getErr != nil and !errors.IsNotFound(getErr) and, in that
branch, log the actual getErr (e.g., using framework.Logf or the existing
logger) with context before returning false,nil so retries continue—ensure you
reference the same getErr, driverNamespace, and the poll closure so only the
logging behavior changes.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 46dab00d-6005-4d5a-9fbc-233ea24214ae
📒 Files selected for processing (5)
test/extended/include.gotest/extended/node/dra/common/counter_validator.gotest/extended/node/dra/example/prerequisites_installer.gotest/extended/node/dra/partitionable/OWNERStest/extended/node/dra/partitionable/partitionable_dra.go
Summary
Adds downstream e2e tests for the DRAPartitionableDevices feature (KEP-4815)
using the upstream dra-example-driver with
kubeletPlugin.gpuPartitionsenabled.Tests:
should publish ResourceSlices with SharedCounters and ConsumesCountersshould allocate partition device to pod via DRAshould mark pod unschedulable when all counters are exhausted on a nodeArchitecture:
Gating:
[OCPFeatureGate:DRAPartitionableDevices]- Prow auto-skips on clusterswithout the gate enabled.
JIRA
https://issues.redhat.com/browse/OCPNODE-4538
Summary by CodeRabbit
Tests
Chores