-
Notifications
You must be signed in to change notification settings - Fork 407
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow CUDA PTX forward compatibility #5527
Conversation
With the current logic one must use the same major architecture version in all circumstances, which is just not practical.
Can one of the admins verify this patch? |
There is already a discussion about this in #5187. |
I don't understand though. 😕 Which compiler is not meant to allow building CUDA code for an older architecture and then run that on a GPU supporting a newer architecture? Why would you explicitly prevent NVIDIA was (rightfully) proudly proclaiming at the last GTC how one could still run applications built in the past (with old CUDA versions) for compute capability 1.0, on Hopper. So what is Kokkos trying to prevent here exactly? I really fail to see the necessity for this check. 😕 |
There is a couple issues: one was that I misunderstood whether we were embedding ptx - if you don't you can't actually do this. But it turns we do. The other is that there can be hidden issues. It turns out for example that one of our tests hangs if you compile for Pre-Volta and run on Volta or newer. This is absolutely reproducable and determinstic. I.e. it will always hang if you compile for pre-volta and run on volta, and it will always pass if you run the same thing on kepler, or you compile for volta and run on volta. Now it turns out that I figure out how to fix it (there needs to be __syncwarp on pre-volta, because you don't have independent forward progress. Now on pre-volta you get syncwarp implicitly, on volta or newer part of the generated code apparently assumes that you are synced (because you compiled for pre-volta) but in actuallity it didn't and you get some dead-lock around a work sharing construct. Another side issue is that we make host-side decisions based on Volta vs pre-Volta largely connected to overload sets of atomics. We redesigned some of that recently so that that problem goes away however (the issue is a bit complicated and has to do with overload set matching requirements for host and device functions, which means you need to have specific double overlaod if you wanted it on the device based on the device side CUDA_ARCH). Long story short: there are weird corner cases which are hard to detect and make your code fail if there is an architecture mismatch even though it works perfectly fine on each arch if you actually compile for it. Good news is: turns out we fixed some inadvertently recently with some code redesign, and I was able to fix the hang too. Which means it actually passed testing with a mismatch now. So we decided to enable this. Note that depending on what you compile for the performance penalty is gonna be pretty big. Though I would assume its not too bad if you say compile for 7.0 and simply use stuff newer than that. |
Requires #5536 |
If this gets merged you can also close #3612 that had the same solution two years ago |
OK to test @crtrott Do not abort but raise a warning though? |
There is already a warning on lines 752-753. Which also very much put this error into question when I first looked at the code... |
Fixed the logic for building Kokkos for an older architecture. With the current logic one must use the same major architecture version in all circumstances, which is just not practical. As it can result in the following kind of nonsensical errors:
Pinging @Yhatoh and @stephenswat.