Skip to content

[NVIDIA-859] Precompiled signed drivers support#79757

Merged
openshift-merge-bot[bot] merged 1 commit into
openshift:mainfrom
josecastillolema:precompiled-driver
May 28, 2026
Merged

[NVIDIA-859] Precompiled signed drivers support#79757
openshift-merge-bot[bot] merged 1 commit into
openshift:mainfrom
josecastillolema:precompiled-driver

Conversation

@josecastillolema
Copy link
Copy Markdown
Member

@josecastillolema josecastillolema commented May 27, 2026

This is intentionally very hardcoded regarding versions because the available image tags in registry.stage.redhat.io use either:

  • rhcos4.20 (OpenShift version-based)
  • rhel9_6 (with underscore, not dot)

When this gets fixed we will move this test to that registry instead cc @enriquebelarte

Signed-off-by: Jose Castillo Lema <josecastillolema@gmail.com>
@openshift-ci openshift-ci Bot requested review from empovit and wabouhamad May 27, 2026 13:13
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 27, 2026

Walkthrough

This PR updates a single CI configuration file to introduce a new signed-driver test job. It adds a release selector for signed-driver with narrowed version bounds, renames an existing test job to reference this release, and configures the job with GPU driver environment variables including precompiled artifact settings and subscription channel specifications.

Changes

Signed-driver test job setup

Layer / File(s) Summary
Signed-driver release and job configuration
ci-operator/config/rh-ecosystem-edge/nvidia-ci/rh-ecosystem-edge-nvidia-ci-main__4.21-stable.yaml
A new signed-driver release selector is added with version bounds 4.21.14-0 through 4.21.15-0 under releases.arm64-latest.prerelease. The test job as identifier is renamed to nvidia-gpu-operator-signed-driver and configured with OPENSHIFT_INSTALL_RELEASE_IMAGE_OVERRIDE set to release:signed-driver, along with a NVIDIAGPU_GPU_CLUSTER_POLICY_PATCH JSON configuration specifying precompiled GPU driver artifacts with repository and version 580.159.03, plus NVIDIAGPU_SUBSCRIPTION_CHANNEL set to v26.3.

🎯 1 (Trivial) | ⏱️ ~3 minutes

🚥 Pre-merge checks | ✅ 15
✅ Passed checks (15 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'NVIDIA-859 Precompiled signed drivers support' directly relates to the main change, which adds configuration for precompiled signed drivers in the CI/CD pipeline.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed PR modifies only CI configuration YAML files. No Ginkgo test definitions exist in this repository. The check is not applicable to non-test code.
Test Structure And Quality ✅ Passed Custom check for Ginkgo test code quality is not applicable—PR modifies only YAML CI configuration files, not test code.
Microshift Test Compatibility ✅ Passed This PR only modifies CI job configuration YAML files and does not add any new Ginkgo e2e tests (It(), Describe(), Context(), When()). The custom check is not applicable to this change.
Single Node Openshift (Sno) Test Compatibility ✅ Passed PR modifies only CI operator configuration (YAML), not adding any new Ginkgo e2e tests, so SNO test compatibility check is not applicable.
Topology-Aware Scheduling Compatibility ✅ Passed PR modifies only CI job configuration, not deployment manifests/operator code/controllers. No scheduling constraints are introduced.
Ote Binary Stdout Contract ✅ Passed PR only modifies YAML configuration files, not Go source code. OTE Binary Stdout Contract check applies exclusively to Go code with stdout writes in process-level code; therefore not applicable here.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed PR adds only CI configuration (YAML), not new Ginkgo e2e test code. Custom check for IPv6/disconnected network compatibility applies only to new test code additions.
No-Weak-Crypto ✅ Passed PR contains only YAML CI configuration changes with no cryptographic code, weak algorithms, custom crypto, or insecure comparisons.
Container-Privileges ✅ Passed File contains no privileged container settings: no privileged: true, hostPID, hostNetwork, hostIPC, SYS_ADMIN, allowPrivilegeEscalation, or root without justification found.
No-Sensitive-Data-In-Logs ✅ Passed No passwords, tokens, API keys, PII, or credentials in the PR. Only standard CI configuration with container references and internal domain names.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 27, 2026
@openshift-merge-bot
Copy link
Copy Markdown
Contributor

[REHEARSALNOTIFIER]
@josecastillolema: the pj-rehearse plugin accommodates running rehearsal tests for the changes in this PR. Expand 'Interacting with pj-rehearse' for usage details. The following rehearsable tests have been affected by this change:

Test name Repo Type Reason
pull-ci-rh-ecosystem-edge-nvidia-ci-main-4.21-stable-nvidia-gpu-operator-signed-driver rh-ecosystem-edge/nvidia-ci presubmit Presubmit changed
pull-ci-rh-ecosystem-edge-nvidia-ci-main-4.21-stable-images rh-ecosystem-edge/nvidia-ci presubmit Ci-operator config changed
pull-ci-rh-ecosystem-edge-nvidia-ci-main-4.21-stable-nvidia-gpu-operator-e2e-25-10-x rh-ecosystem-edge/nvidia-ci presubmit Ci-operator config changed
pull-ci-rh-ecosystem-edge-nvidia-ci-main-4.21-stable-nvidia-gpu-operator-e2e-26-3-x rh-ecosystem-edge/nvidia-ci presubmit Ci-operator config changed
pull-ci-rh-ecosystem-edge-nvidia-ci-main-4.21-stable-nvidia-gpu-operator-e2e-arm64 rh-ecosystem-edge/nvidia-ci presubmit Ci-operator config changed
pull-ci-rh-ecosystem-edge-nvidia-ci-main-4.21-stable-nvidia-gpu-operator-e2e-master rh-ecosystem-edge/nvidia-ci presubmit Ci-operator config changed

Prior to this PR being merged, you will need to either run and acknowledge or opt to skip these rehearsals.

Interacting with pj-rehearse

Comment: /pj-rehearse to run up to 5 rehearsals
Comment: /pj-rehearse skip to opt-out of rehearsals
Comment: /pj-rehearse {test-name}, with each test separated by a space, to run one or more specific rehearsals
Comment: /pj-rehearse more to run up to 10 rehearsals
Comment: /pj-rehearse max to run up to 25 rehearsals
Comment: /pj-rehearse auto-ack to run up to 5 rehearsals, and add the rehearsals-ack label on success
Comment: /pj-rehearse list to get an up-to-date list of affected jobs
Comment: /pj-rehearse abort to abort all active rehearsals
Comment: /pj-rehearse network-access-allowed to allow rehearsals of tests that have the restrict_network_access field set to false. This must be executed by an openshift org member who is not the PR author

Once you are satisfied with the results of the rehearsals, comment: /pj-rehearse ack to unblock merge. When the rehearsals-ack label is present on your PR, merge will no longer be blocked by rehearsals.
If you would like the rehearsals-ack label removed, comment: /pj-rehearse reject to re-block merging.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@ci-operator/config/rh-ecosystem-edge/nvidia-ci/rh-ecosystem-edge-nvidia-ci-main__4.21-stable.yaml`:
- Around line 163-167: The NVIDIAGPU_GPU_CLUSTER_POLICY_PATCH currently points
to a non-public Quay image tag
(quay.io/jcastillolema/gpu-driver-rhel9:580.159.03) which CI cannot discover;
update the patch value used by NVIDIAGPU_GPU_CLUSTER_POLICY_PATCH to reference a
publicly accessible/official driver repository and tag (or replace with a known
public mirror) or coordinate to publish the specified tag so it is pullable by
CI; modify the string value inside NVIDIAGPU_GPU_CLUSTER_POLICY_PATCH in the
YAML (the JSON patch entries for /spec/driver/repository, /spec/driver/image,
/spec/driver/version) to the public repository, image and tag that are reachable
by CI.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 33c15f19-f485-4b40-95df-d6ae606c8d17

📥 Commits

Reviewing files that changed from the base of the PR and between 766fea0 and 503bd7b.

⛔ Files ignored due to path filters (1)
  • ci-operator/jobs/rh-ecosystem-edge/nvidia-ci/rh-ecosystem-edge-nvidia-ci-main-presubmits.yaml is excluded by !ci-operator/jobs/**
📒 Files selected for processing (1)
  • ci-operator/config/rh-ecosystem-edge/nvidia-ci/rh-ecosystem-edge-nvidia-ci-main__4.21-stable.yaml

@josecastillolema
Copy link
Copy Markdown
Member Author

/pj-rehearse pull-ci-rh-ecosystem-edge-nvidia-ci-main-4.21-stable-nvidia-gpu-operator-signed-driver

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

@josecastillolema: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

@josecastillolema
Copy link
Copy Markdown
Member Author

/pj-rehearse pull-ci-rh-ecosystem-edge-nvidia-ci-main-4.21-stable-nvidia-gpu-operator-signed-driver

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

@josecastillolema: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 27, 2026

@josecastillolema: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@josecastillolema
Copy link
Copy Markdown
Member Author

josecastillolema commented May 28, 2026

Looks like it worked fine:

  1. Driver container image (gpu_operand_pod_images.txt):

    nvidia-driver-daemonset-5.14.0-570.112.1.el9.6-rhel9.6-hwc9k:
      quay.io/jcastillolema/gpu-driver-rhel9:580.159.03-5.14.0-570.112.1.el9_6.x86_64-rhel9.6

    The image tag includes the kernel version (5.14.0-570.112.1.el9_6.x86_64), the precompiled image format. Standard (non-precompiled) driver images use a plain version tag like 580.159.03 and compile on-node.

  2. ClusterPolicy (cluster_policy.yaml): confirms usePrecompiled: true with repository: quay.io/jcastillolema.

  3. Driver pod log: the nvidia-driver-ctr container goes straight to _load_drivermodprobe nvidia without any compilation step. There's no dnf install, no make, no kernel-devel download, no nvidia-installer: just loading pre-built kernel modules. The entire driver init took seconds.

  4. Timing: ClusterPolicy went from created (18:10:10) to ready (18:14:26) in ~4 minutes. On-node compilation typically takes larger on a g4dn.xlarge.

@josecastillolema
Copy link
Copy Markdown
Member Author

cc @ShiraEzra

@empovit
Copy link
Copy Markdown
Contributor

empovit commented May 28, 2026

Nice!

/lgtm
/approve

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label May 28, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 28, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: empovit, josecastillolema

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@josecastillolema
Copy link
Copy Markdown
Member Author

/pj-rehearse ack

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

@josecastillolema: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

@openshift-merge-bot openshift-merge-bot Bot added the rehearsals-ack Signifies that rehearsal jobs have been acknowledged label May 28, 2026
@openshift-merge-bot openshift-merge-bot Bot merged commit 1d16974 into openshift:main May 28, 2026
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. rehearsals-ack Signifies that rehearsal jobs have been acknowledged

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants