Allow CUDA PTX forward compatibility #5527

krasznaa · 2022-10-06T14:47:15Z

Fixed the logic for building Kokkos for an older architecture. With the current logic one must use the same major architecture version in all circumstances, which is just not practical. As it can result in the following kind of nonsensical errors:

Kokkos::Cuda::initialize ERROR: running kernels compiled for compute capability 5.2 on device with compute capability 7.5 is not supported by CUDA!

Pinging @Yhatoh and @stephenswat.

With the current logic one must use the same major architecture version in all circumstances, which is just not practical.

dalg24-jenkins · 2022-10-06T14:47:17Z

Can one of the admins verify this patch?

masterleinad · 2022-10-06T14:54:29Z

There is already a discussion about this in #5187.

krasznaa · 2022-10-06T20:21:13Z

I don't understand though. 😕 Which compiler is not meant to allow building CUDA code for an older architecture and then run that on a GPU supporting a newer architecture? Why would you explicitly prevent nvcc from including PTX in the binary that it produces?

NVIDIA was (rightfully) proudly proclaiming at the last GTC how one could still run applications built in the past (with old CUDA versions) for compute capability 1.0, on Hopper.

So what is Kokkos trying to prevent here exactly? I really fail to see the necessity for this check. 😕

crtrott · 2022-10-07T05:20:32Z

There is a couple issues: one was that I misunderstood whether we were embedding ptx - if you don't you can't actually do this. But it turns we do. The other is that there can be hidden issues. It turns out for example that one of our tests hangs if you compile for Pre-Volta and run on Volta or newer. This is absolutely reproducable and determinstic. I.e. it will always hang if you compile for pre-volta and run on volta, and it will always pass if you run the same thing on kepler, or you compile for volta and run on volta. Now it turns out that I figure out how to fix it (there needs to be __syncwarp on pre-volta, because you don't have independent forward progress. Now on pre-volta you get syncwarp implicitly, on volta or newer part of the generated code apparently assumes that you are synced (because you compiled for pre-volta) but in actuallity it didn't and you get some dead-lock around a work sharing construct. Another side issue is that we make host-side decisions based on Volta vs pre-Volta largely connected to overload sets of atomics. We redesigned some of that recently so that that problem goes away however (the issue is a bit complicated and has to do with overload set matching requirements for host and device functions, which means you need to have specific double overlaod if you wanted it on the device based on the device side CUDA_ARCH).

Long story short: there are weird corner cases which are hard to detect and make your code fail if there is an architecture mismatch even though it works perfectly fine on each arch if you actually compile for it.

Good news is: turns out we fixed some inadvertently recently with some code redesign, and I was able to fix the hang too. Which means it actually passed testing with a mismatch now. So we decided to enable this. Note that depending on what you compile for the performance penalty is gonna be pretty big. Though I would assume its not too bad if you say compile for 7.0 and simply use stuff newer than that.

crtrott · 2022-10-07T22:41:48Z

Requires #5536

j8asic · 2022-10-10T14:59:43Z

If this gets merged you can also close #3612 that had the same solution two years ago

dalg24 · 2022-10-11T14:23:31Z

OK to test

@crtrott Do not abort but raise a warning though?

krasznaa · 2022-10-11T14:29:15Z

There is already a warning on lines 752-753. Which also very much put this error into question when I first looked at the code...

Fixed the logic for building Kokkos for an older architecture.

0ef177c

With the current logic one must use the same major architecture version in all circumstances, which is just not practical.

crtrott mentioned this pull request Oct 7, 2022

CUDA: fixes mixed-arch-use of WorkGraphPolicy #5536

Merged

crtrott approved these changes Oct 7, 2022

View reviewed changes

masterleinad mentioned this pull request Oct 10, 2022

Allow CUDA PTX forward compatibility if using Clang #5187

Closed

masterleinad linked an issue Oct 10, 2022 that may be closed by this pull request

Running older SM kernels on newer archs results in Kokkos::abort #3612

Closed

masterleinad approved these changes Oct 11, 2022

View reviewed changes

dalg24 changed the title ~~CUDA Init Fix, develop branch (2022.10.06.)~~ Allow CUDA PTX forward compatibility Oct 12, 2022

dalg24 merged commit 04de99c into kokkos:develop Oct 12, 2022

dalg24 mentioned this pull request Oct 12, 2022

CHANGELOG: 4.0 #5439

Closed

masterleinad mentioned this pull request Dec 16, 2022

cuda: Remove hard error check on compute compatibility #5702

Closed

crtrott added the Patch Release label Dec 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow CUDA PTX forward compatibility #5527

Allow CUDA PTX forward compatibility #5527

krasznaa commented Oct 6, 2022

dalg24-jenkins commented Oct 6, 2022

masterleinad commented Oct 6, 2022

krasznaa commented Oct 6, 2022

crtrott commented Oct 7, 2022

crtrott commented Oct 7, 2022

j8asic commented Oct 10, 2022

dalg24 commented Oct 11, 2022

krasznaa commented Oct 11, 2022

Allow CUDA PTX forward compatibility #5527

Allow CUDA PTX forward compatibility #5527

Conversation

krasznaa commented Oct 6, 2022

dalg24-jenkins commented Oct 6, 2022

masterleinad commented Oct 6, 2022

krasznaa commented Oct 6, 2022

crtrott commented Oct 7, 2022

crtrott commented Oct 7, 2022

j8asic commented Oct 10, 2022

dalg24 commented Oct 11, 2022

krasznaa commented Oct 11, 2022