Skip to content

Conversation

Copy link

Copilot AI commented Nov 16, 2025

Description

closes NVIDIA#1024
closes NVIDIA#985

Refactors Windows GPU driver installation infrastructure to support architecture-specific drivers and driver mode switching, and expands test coverage across Python/CUDA combinations with driver mode verification.

Changes

Driver installer refactoring (ci/tools/install_gpu_driver.ps1):

  • Moved from .github/workflows/ to ci/tools/ for consistency with other CI tooling
  • Based on CCCL: Script based on NVIDIA/cccl with minimal modifications to maintain sync
  • GPU type detection via GPU_TYPE environment variable passed directly from test matrix
  • Data center GPUs (L4, A100, T4, V100, etc.) → Tesla driver; Desktop GPUs (RTX series) → GeForce/Quadro driver
  • Driver mode configuration via nvidia-smi -fdm 2: MCDM mode switching only
  • Uses driver version 581.15 (from CCCL)
  • Minimal changes from CCCL:
    • Added rtxpro6000 to desktop GPU list
    • Enhanced driver mode section to support DRIVER_MODE environment variable for MCDM switching
    • Uses GPU_TYPE env var directly instead of parsing runner label (deviation from CCCL for better integration with cuda-python workflows)
    • Data center GPUs: TCC (default) → MCDM when DRIVER_MODE=MCDM
    • Desktop GPUs: WDDM (default) → MCDM when DRIVER_MODE=MCDM

Test matrix expansion (ci/test-matrix.json):

  • Windows: 8 → 12 configurations (6 Python versions × 2 CUDA versions)
    • Python: 3.10, 3.11, 3.12, 3.13, 3.14, 3.14t
    • CUDA: 12.9.1, 13.0.2
    • GPUs: rtx2080, rtxpro6000, rtx4090, l4, a100, v100
    • New DRIVER_MODE field: TCC/MCDM for data center GPUs, WDDM/MCDM for desktop GPUs
    • Replaced v100 with rtxpro6000 for CUDA 13.0.2 configurations due to NVRTC compatibility constraints
  • Emptied linux.nightly and windows.nightly arrays per requirements

Workflow integration (.github/workflows/test-wheel-windows.yml):

  • Passes DRIVER_MODE and GPU_TYPE to installer as environment variables
  • Matrix automatically provides GPU-appropriate mode for each test configuration
  • Driver mode verification: Added post-installation verification step that runs nvidia-smi -q, checks the "Driver Model" output, and validates it matches the expected DRIVER_MODE from the matrix configuration

Example Usage

The installer now adapts automatically based on matrix configuration:

{
  "ARCH": "amd64",
  "PY_VER": "3.12",
  "CUDA_VER": "12.9.1",
  "GPU": "l4",
  "DRIVER_MODE": "MCDM"
}

Results in:

  • Detection: L4 received directly via GPU_TYPE env var
  • Driver: Tesla driver from us.download.nvidia.com/tesla/581.15/...
  • Mode: nvidia-smi -fdm 2 switches from TCC (default) to MCDM mode
  • Verification: Workflow validates the driver is in MCDM mode via nvidia-smi -q

Checklist

  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.
Original prompt

This change addresses @NVIDIA/cuda-pythonNVIDIA/issues/1024 and @NVIDIA/cuda-pythonNVIDIA/issues/985, but the PR should target leofang/cuda-python, not NVIDIA/cuda-python.

Goals:

  1. Move the Windows GPU driver installer script into ci/tools and update all call sites.
  2. Update the installer script semantics to match and extend the implementation in CCCL, including driver mode handling (TCC, MCDM, WDDM).
  3. Adjust the CI test matrix for Windows to cover more Python/CUDA combinations and make nightly sections empty.
  4. Ensure the Windows workflows pass the driver mode information from test-matrix.json into the installer script.

Details and requirements:

  1. Move install_gpu_driver.ps1 in cuda-python

    • Current location: .github/workflows/install_gpu_driver.ps1 (see image1).
    • New location: ci/tools/install_gpu_driver.ps1.
    • Update all GitHub Actions workflows or other scripts that invoke this installer so they call the script from its new path.
    • Keep this move and all call-site updates in a single commit to ease rebasing.
  2. Update install_gpu_driver.ps1 behavior and sync with CCCL

    • The implementation in cuda-python should be updated to follow the logic of CCCL’s installer at .github/actions/workflow-run-job-windows/install_gpu_driver.ps1 (see image2), with the following expectations:
      • Extract the GPU type from the self-hosted runner label, which comes from the JOB_RUNNER environment variable.
      • Recognize data center GPUs vs desktop GPUs and choose the correct installer download path and filename based on GPU type.
      • Support selecting the correct Windows driver for a given GPU architecture and driver mode: TCC, MCDM, WDDM.
        • Data center GPUs (e.g., L4, A100) can only run under TCC or MCDM; they do not support WDDM.
        • Desktop GPUs use WDDM.
      • The script must be able to switch modes when applicable (for data center GPUs, it currently only toggles TCC→MCDM in CCCL via nvidia-smi -fdm 2 plus a Display device reset). Extend this so that:
        • It can also ensure WDDM where appropriate.
        • It cleanly handles all three modes: TCC, MCDM, and WDDM.
    • After updating the cuda-python copy, also update the original script in CCCL at .github/actions/workflow-run-job-windows/install_gpu_driver.ps1 to support WDDM in the same way.
    • The two scripts should remain logically aligned so maintenance is easy.
    • Keep the behavioral update to the scripts (both repos) in a single commit that is separate from the file-move commit.
  3. Update ci/test-matrix.json nightly sections in cuda-python

    • File: ci/test-matrix.json (see image3).
    • For both linux.nightly and windows.nightly, make the arrays empty. Leave the keys present, but set them to an empty list: "nightly": [].
    • Do not change the existing linux.pull-request or linux.special_runners sections.
    • Keep the change that empties the nightly sections in a single, isolated commit.
  4. Expand the Windows test matrix in ci/test-matrix.json

    • Still in ci/test-matrix.json, modify the windows.pull-request matrix to:
      • Cover 6 Python versions: 3.10, 3.11, 3.12, 3.13, 3.14, 3.14t.
      • Cover 2 CUDA versions: 12.9.x and 13.0.x. Please match the exact patch versions already used elsewhere in the repo (e.g., 12.9.1, 13.0.2), keeping consistency with the Linux matrix when possible.
      • Result: 6 × 2 = 12 matrix entries under windows.pull-request.
    • For each of the 12 combinations, assign a GPU from the set (l4, rtxpro6000, a100, rtx6000ada) in a way similar to how GPUs are chosen for linux.pull-request.
    • Introduce and populate a driver mode dimension in the Windows matrix. The JSON already has a DRIVER field that currently holds values like "latest" or "earliest"; extend or repurpose this field (or add a new one if clearer) so that:
      • For each Windows entry, we explicitly encode the driver mode (TCC, MCDM, WDDM) so that the workflow knows which mode is expected for the job.
      • Data center GPUs like l4 and a100 must use either TCC or MCDM (no WDDM).
      • Desktop GPUs (e.g., rtxpro6000, rtx6000ada) should use WDDM.
    • Maintain the _sorted_by invariant: matrices must remain sorted in ascending order by [ARCH, PY_VER, CUDA_VER, LOCAL_CTK, GPU, DRIVER] after your changes.
    • Keep the Windows matrix expansion (and related structural changes for driver mode) in a single commit separate from the nightly-emptying commit and from the script commits.
  5. Wire driver mode into the Windows workflows

    • Find where ci/test-matrix.json is parsed for Windows workflows (e.g., CI jobs defined under .github/workflows/ that run on windows-* runners and use ci/test-matrix.json).
    • Update the workflow(s) so that when they read the test matrix for Windows, they also read the driver mode (from DRIVER or a dedicated field you...

This pull request was created as a result of the following prompt from Copilot chat.

This change addresses @NVIDIA/cuda-pythonNVIDIA/issues/1024 and @NVIDIA/cuda-pythonNVIDIA/issues/985, but the PR should target leofang/cuda-python, not NVIDIA/cuda-python.

Goals:

  1. Move the Windows GPU driver installer script into ci/tools and update all call sites.
  2. Update the installer script semantics to match and extend the implementation in CCCL, including driver mode handling (TCC, MCDM, WDDM).
  3. Adjust the CI test matrix for Windows to cover more Python/CUDA combinations and make nightly sections empty.
  4. Ensure the Windows workflows pass the driver mode information from test-matrix.json into the installer script.

Details and requirements:

  1. Move install_gpu_driver.ps1 in cuda-python

    • Current location: .github/workflows/install_gpu_driver.ps1 (see image1).
    • New location: ci/tools/install_gpu_driver.ps1.
    • Update all GitHub Actions workflows or other scripts that invoke this installer so they call the script from its new path.
    • Keep this move and all call-site updates in a single commit to ease rebasing.
  2. Update install_gpu_driver.ps1 behavior and sync with CCCL

    • The implementation in cuda-python should be updated to follow the logic of CCCL’s installer at .github/actions/workflow-run-job-windows/install_gpu_driver.ps1 (see image2), with the following expectations:
      • Extract the GPU type from the self-hosted runner label, which comes from the JOB_RUNNER environment variable.
      • Recognize data center GPUs vs desktop GPUs and choose the correct installer download path and filename based on GPU type.
      • Support selecting the correct Windows driver for a given GPU architecture and driver mode: TCC, MCDM, WDDM.
        • Data center GPUs (e.g., L4, A100) can only run under TCC or MCDM; they do not support WDDM.
        • Desktop GPUs use WDDM.
      • The script must be able to switch modes when applicable (for data center GPUs, it currently only toggles TCC→MCDM in CCCL via nvidia-smi -fdm 2 plus a Display device reset). Extend this so that:
        • It can also ensure WDDM where appropriate.
        • It cleanly handles all three modes: TCC, MCDM, and WDDM.
    • After updating the cuda-python copy, also update the original script in CCCL at .github/actions/workflow-run-job-windows/install_gpu_driver.ps1 to support WDDM in the same way.
    • The two scripts should remain logically aligned so maintenance is easy.
    • Keep the behavioral update to the scripts (both repos) in a single commit that is separate from the file-move commit.
  3. Update ci/test-matrix.json nightly sections in cuda-python

    • File: ci/test-matrix.json (see image3).
    • For both linux.nightly and windows.nightly, make the arrays empty. Leave the keys present, but set them to an empty list: "nightly": [].
    • Do not change the existing linux.pull-request or linux.special_runners sections.
    • Keep the change that empties the nightly sections in a single, isolated commit.
  4. Expand the Windows test matrix in ci/test-matrix.json

    • Still in ci/test-matrix.json, modify the windows.pull-request matrix to:
      • Cover 6 Python versions: 3.10, 3.11, 3.12, 3.13, 3.14, 3.14t.
      • Cover 2 CUDA versions: 12.9.x and 13.0.x. Please match the exact patch versions already used elsewhere in the repo (e.g., 12.9.1, 13.0.2), keeping consistency with the Linux matrix when possible.
      • Result: 6 × 2 = 12 matrix entries under windows.pull-request.
    • For each of the 12 combinations, assign a GPU from the set (l4, rtxpro6000, a100, rtx6000ada) in a way similar to how GPUs are chosen for linux.pull-request.
    • Introduce and populate a driver mode dimension in the Windows matrix. The JSON already has a DRIVER field that currently holds values like "latest" or "earliest"; extend or repurpose this field (or add a new one if clearer) so that:
      • For each Windows entry, we explicitly encode the driver mode (TCC, MCDM, WDDM) so that the workflow knows which mode is expected for the job.
      • Data center GPUs like l4 and a100 must use either TCC or MCDM (no WDDM).
      • Desktop GPUs (e.g., rtxpro6000, rtx6000ada) should use WDDM.
    • Maintain the _sorted_by invariant: matrices must remain sorted in ascending order by [ARCH, PY_VER, CUDA_VER, LOCAL_CTK, GPU, DRIVER] after your changes.
    • Keep the Windows matrix expansion (and related structural changes for driver mode) in a single commit separate from the nightly-emptying commit and from the script commits.
  5. Wire driver mode into the Windows workflows

    • Find where ci/test-matrix.json is parsed for Windows workflows (e.g., CI jobs defined under .github/workflows/ that run on windows-* runners and use ci/test-matrix.json).
    • Update the workflow(s) so that when they read the test matrix for Windows, they also read the driver mode (from DRIVER or a dedicated field you introduce for driver mode).
    • Ensure that this driver mode is passed through into the environment or as a parameter when invoking install_gpu_driver.ps1.
    • Update the PowerShell installer script to accept this driver mode (e.g., via an env var or script parameter) and use it to decide:
      • Which driver to download/install, if mode affects that.
      • Which nvidia-smi or other commands to run to ensure the GPU is in the desired mode (TCC, MCDM, WDDM).
    • Keep all workflow wiring and associated script parameter/env-var changes in a single commit distinct from the previous steps.
  6. General notes and expectations

    • Follow the existing coding and style conventions for JSON, PowerShell, and GitHub Actions workflows in this repo.
    • Ensure all commits are logically grouped as requested:
      1. Commit 1: Move install_gpu_driver.ps1 to ci/tools and update call sites.
      2. Commit 2: Update install_gpu_driver.ps1 in cuda-python and CCCL to support architecture-specific drivers and all Windows driver modes.
      3. Commit 3: Make nightly sections empty in ci/test-matrix.json.
      4. Commit 4: Expand Windows test matrix entries and encode driver modes.
      5. Commit 5: Wire driver mode from test-matrix.json into Windows workflows and the installer script.
    • Run the repo’s CI linting / formatting checks if available, and ensure the updated workflows are syntactically valid.

✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

Copilot AI self-assigned this Nov 16, 2025
Copilot AI and others added 5 commits November 16, 2025 20:00
Co-authored-by: leofang <5534781+leofang@users.noreply.github.com>
…r modes

Co-authored-by: leofang <5534781+leofang@users.noreply.github.com>
Co-authored-by: leofang <5534781+leofang@users.noreply.github.com>
Co-authored-by: leofang <5534781+leofang@users.noreply.github.com>
Co-authored-by: leofang <5534781+leofang@users.noreply.github.com>
Copilot AI changed the title [WIP] Move Windows GPU driver installer script and update workflows Refactor Windows GPU driver installer and expand test matrix Nov 16, 2025
Copilot finished work on behalf of leofang November 16, 2025 20:10
Copilot AI requested a review from leofang November 16, 2025 20:10
@leofang
Copy link
Owner

leofang commented Nov 16, 2025

@copilot, the ask was to reuse the installer script from the CCCL repo. It must stay in sync and has minimal changes. Right now the change is too much.

Copy link
Author

Copilot AI commented Nov 16, 2025

@copilot, the ask was to reuse the installer script from the CCCL repo. It must stay in sync and has minimal changes. Right now the change is too much.

I've reverted the installer script to the original simple version. I don't have access to the CCCL repository in this workspace. Could you please provide the CCCL installer script from .github/actions/workflow-run-job-windows/install_gpu_driver.ps1 that I should use as a reference, or point me to where I can access it?

The script has been reverted in my working tree but not yet committed - waiting for the CCCL reference to make the correct minimal changes.

@copilot, the ask was to reuse the installer script from the CCCL repo. It must stay in sync and has minimal changes. Right now the change is too much.

Updated the installer script to match the CCCL implementation with minimal changes in commit 35fa159.

Changes from CCCL:

  • Added rtxpro6000 to desktop GPU list
  • Enhanced driver mode section to support DRIVER_MODE env var (TCC/MCDM), defaults to MCDM for backward compatibility
  • Updated CUDA version comment: 12.8 → 13.0

The script is now 93 lines (vs 75 in CCCL), with the 18-line difference being the DRIVER_MODE handling logic.

…r mode support

Co-authored-by: leofang <5534781+leofang@users.noreply.github.com>
Copilot finished work on behalf of leofang November 16, 2025 20:17
Copy link
Owner

@leofang leofang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Co-authored-by: leofang <5534781+leofang@users.noreply.github.com>
Copilot AI requested a review from leofang November 16, 2025 20:44
Copilot finished work on behalf of leofang November 16, 2025 20:44
Co-authored-by: leofang <5534781+leofang@users.noreply.github.com>
Copilot AI requested a review from leofang November 16, 2025 20:52
Copilot finished work on behalf of leofang November 16, 2025 20:52
leofang and others added 2 commits November 16, 2025 16:35
- we do not have access to rtx6000ada
- rtxpro6000 is a datacenter card
- cover WDDM in at least 2 pipelines
… different modes

rtx2080, rtx4090, rtxpro6000, v100, a100, l4 (t4 nodes are too slow)
Copy link
Owner

@leofang leofang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Co-authored-by: leofang <5534781+leofang@users.noreply.github.com>
Copilot AI requested a review from leofang November 17, 2025 00:36
Copilot finished work on behalf of leofang November 17, 2025 00:36
Removed redundant 'Ensure GPU is working' step and kept the driver mode verification.
leofang and others added 15 commits November 16, 2025 20:39
* Update cybind-generated bindings

* Remove safe_decode_string
…_lib (NVIDIA#1258)

Remove redundant call to finder.try_site_packages() on line 28 that was
immediately overwritten by the same call on line 30. This duplicate call
was wasteful (unnecessary filesystem traversal) and indicates a code
review oversight.

The function now correctly calls try_site_packages() only once, improving
performance with no functional changes.

Location: cuda_pathfinder/cuda/pathfinder/_dynamic_libs/load_nvidia_dynamic_lib.py:28
@leofang leofang force-pushed the copilot/move-install-gpu-driver-script branch from cac4f86 to f716719 Compare November 18, 2025 06:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

CI: Expand Windows test matrix CI: Figure out a test plan to cover all Windows driver modes

6 participants