Modify CI to enable tests on CUDA 13.1 and add Dockerfiles for CUDA 13.1#21564
Modify CI to enable tests on CUDA 13.1 and add Dockerfiles for CUDA 13.1#21564kswiecicki merged 1 commit intosyclfrom
Conversation
bfebf80 to
21002aa
Compare
d93c4dc to
c545d4e
Compare
07681fd to
4ba045d
Compare
9547d92 to
774d559
Compare
|
LGTM. The failing tests from the SYCL pre-commit / CUDA 13.1 will be disabled in a separate PR, with appropriate issue trackers |
| target_devices: cuda:gpu | ||
| - name: NVIDIA/CUDA 13.1 | ||
| runner: '["Linux", "cuda13"]' | ||
| image: "ghcr.io/intel/llvm/ubuntu2404_intel_drivers_cuda131:latest" |
There was a problem hiding this comment.
Is there a reason to adding new containers/workflows and not use replacing the existing ones? I would prefer to avoid the new complexity if possible, thanks.
There was a problem hiding this comment.
Old containers will not work because CUDA 13+ requires a new toolkit (installed in the container) and matching drivers (installed on the newhost machine)
There was a problem hiding this comment.
I would rather we update the host machine drivers and use the new toolkit than add a new container. I can update the drivers if you agree with this approach.
There was a problem hiding this comment.
your idea is only working for UR test, but for sycl we can't run jobs without container because if we don't define container, workflow will use default intel/llvm/ubuntu2404_intel_drivers:latest and the job will fail https://github.com/intel/llvm/actions/runs/24386352355/job/71224477448#step:21:27
There was a problem hiding this comment.
Sorry, I didn't mean we shouldn't use a container. I was saying we should just update the existing container to use the new CUDA toolkit instead of keeping the existing one and adding a new one that has the new CUDA toolkit.
In the CI log, It looks like we get no devices, and almost surely that's because the container is using a toolkit that's too new. So probably the fix is to just update the kernel driver on all our CUDA runners.
Please finalize this PR for merge, and I will then update the drivers on our CUDA runners, rerun CI and then merge this PR.
Also, if you know have a recommended kernel driver version to use, let me know. Ping me again when I should try updating the driver. Thanks
There was a problem hiding this comment.
@sarnex so you’re proposing to update the existing container to CUDA 13 drivers and toolkit?
Currently, some customers are using a CUDA adapter based on version 12.8. If we switch to CUDA 13 or newer, we could miss issues that reproduce only on that version. Also this requires passing -DUR_CONFORMANCE_NVIDIA_ARCH="sm_75" to all CUDA configurations (see https://github.com/intel/llvm/pull/21564/changes#diff-a718eeef19a58e1d7ae466d96cf2a94859c8e2cb4739a335d330406a9b22a748R132), as the default version sm_50 would not work (or set is as default).
My suggestion is to continue testing both versions for a period of time and, at a later stage, gradually drop support for CUDA 12.x.
There was a problem hiding this comment.
Ok I was missing that context, thanks
There was a problem hiding this comment.
Let me know when this PR is ready for review with keeping the new containing (seems not finalized rn, some commented out code etc)
ce3fc8f to
774d559
Compare
|
@sarnex Hi this PR is ready to re-review, thanks |
|
Sorry for the slow response, was really busy yesterday. Will look now |
sarnex
left a comment
There was a problem hiding this comment.
ideally id like to avoid the duplication between the images but i dont have any better ideas
| target_devices: cuda:gpu | ||
| - name: NVIDIA/CUDA 13.1 | ||
| runner: '["Linux", "cuda13"]' | ||
| image: "ghcr.io/intel/llvm/ubuntu2404_intel_drivers_cuda131:latest" |
There was a problem hiding this comment.
nit: i dont think we need quotes around the image name
|
CI failures are unrelated to the GH workflow changes in this PR. |
No description provided.