Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow CUDA PTX forward compatibility #5527

Merged
merged 1 commit into from
Oct 12, 2022

Conversation

krasznaa
Copy link
Contributor

@krasznaa krasznaa commented Oct 6, 2022

Fixed the logic for building Kokkos for an older architecture. With the current logic one must use the same major architecture version in all circumstances, which is just not practical. As it can result in the following kind of nonsensical errors:

Kokkos::Cuda::initialize ERROR: running kernels compiled for compute capability 5.2 on device with compute capability 7.5 is not supported by CUDA!

Pinging @Yhatoh and @stephenswat.

With the current logic one must use the same major architecture
version in all circumstances, which is just not practical.
@dalg24-jenkins
Copy link
Collaborator

Can one of the admins verify this patch?

@masterleinad
Copy link
Contributor

There is already a discussion about this in #5187.

@krasznaa
Copy link
Contributor Author

krasznaa commented Oct 6, 2022

I don't understand though. 😕 Which compiler is not meant to allow building CUDA code for an older architecture and then run that on a GPU supporting a newer architecture? Why would you explicitly prevent nvcc from including PTX in the binary that it produces?

NVIDIA was (rightfully) proudly proclaiming at the last GTC how one could still run applications built in the past (with old CUDA versions) for compute capability 1.0, on Hopper.

So what is Kokkos trying to prevent here exactly? I really fail to see the necessity for this check. 😕

@crtrott
Copy link
Member

crtrott commented Oct 7, 2022

There is a couple issues: one was that I misunderstood whether we were embedding ptx - if you don't you can't actually do this. But it turns we do. The other is that there can be hidden issues. It turns out for example that one of our tests hangs if you compile for Pre-Volta and run on Volta or newer. This is absolutely reproducable and determinstic. I.e. it will always hang if you compile for pre-volta and run on volta, and it will always pass if you run the same thing on kepler, or you compile for volta and run on volta. Now it turns out that I figure out how to fix it (there needs to be __syncwarp on pre-volta, because you don't have independent forward progress. Now on pre-volta you get syncwarp implicitly, on volta or newer part of the generated code apparently assumes that you are synced (because you compiled for pre-volta) but in actuallity it didn't and you get some dead-lock around a work sharing construct. Another side issue is that we make host-side decisions based on Volta vs pre-Volta largely connected to overload sets of atomics. We redesigned some of that recently so that that problem goes away however (the issue is a bit complicated and has to do with overload set matching requirements for host and device functions, which means you need to have specific double overlaod if you wanted it on the device based on the device side CUDA_ARCH).

Long story short: there are weird corner cases which are hard to detect and make your code fail if there is an architecture mismatch even though it works perfectly fine on each arch if you actually compile for it.

Good news is: turns out we fixed some inadvertently recently with some code redesign, and I was able to fix the hang too. Which means it actually passed testing with a mismatch now. So we decided to enable this. Note that depending on what you compile for the performance penalty is gonna be pretty big. Though I would assume its not too bad if you say compile for 7.0 and simply use stuff newer than that.

@crtrott
Copy link
Member

crtrott commented Oct 7, 2022

Requires #5536

@j8asic
Copy link
Contributor

j8asic commented Oct 10, 2022

If this gets merged you can also close #3612 that had the same solution two years ago

@masterleinad masterleinad linked an issue Oct 10, 2022 that may be closed by this pull request
@dalg24
Copy link
Member

dalg24 commented Oct 11, 2022

OK to test

@crtrott Do not abort but raise a warning though?

@krasznaa
Copy link
Contributor Author

There is already a warning on lines 752-753. Which also very much put this error into question when I first looked at the code...

@dalg24 dalg24 changed the title CUDA Init Fix, develop branch (2022.10.06.) Allow CUDA PTX forward compatibility Oct 12, 2022
@dalg24 dalg24 merged commit 04de99c into kokkos:develop Oct 12, 2022
@dalg24 dalg24 mentioned this pull request Oct 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Running older SM kernels on newer archs results in Kokkos::abort
6 participants