New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA: Make block, thread, and warp indices unsigned. #6112
base: main
Are you sure you want to change the base?
Conversation
This aligns with their signedness in CUDA C/C++, and results in slightly more efficient code using fewer instructions in some cases due to the reduction in operations relating to signedness - for example, the reduction kernels become more efficient with this commit which increased their tendency to race (fixed in the previous commit).
The `syncwarp()` function used in the reduction kernels was introduced with CUDA 9.0. Rather than trying to make the availability of reduction kernels conditional on the CUDA Toolkit version, it seems more sensible to just drop CUDA 8.0 now that it is quite old. This commit also includes fixes for the logic setting `SUPPORTED_CC` in `nvvm.py` (when there was no suitable toolkit, it was creating a tuple containing the empty tuple) and added an error when an attempt is made to use NVVM when no supported toolkit is found.
Using a partial mask with `syncwarp()` is only supported on CC 7.0 and greater architectures. This commit switches the implementation to always use a full mask, with the use of a partial mask on CC 7.0 and above a potential future optimization.
When I read the description, my first thought is that it might increase the number of unnecessary cast to |
@sklam is the avoidance of casts to |
So far comparing PTX before and after this patch (with a subset of the test suite) I haven't seen any instances where typing is casting to floats, but I suspect that signed / unsigned comparisons are forcing a cast to float because there are only lowering implementations for signed vs. signed and unsigned vs. unsigned. Looking into this a bit now... |
BTW @sklam - could you do a buildfarm run of this please, in case it turns up any other latent issues? This feels like the sort of thing that will surprise me on different OSes, devices, and toolkit versions (though I did endeavour to test with various toolkit versions). |
The cast to float is only going to be a problem for |
I've broken out the straightforward parts of this PR (everything but changing the signedness of indices) into a separate PR so that they can be reviewed / merged independently: #6127. |
In the meantime for this PR, I'm looking into an acceptable way to avoid casting to floats for the signed / unsigned comparison that occurs when comparing a value based on one of these indices with e.g. the dimension of an array (e.g. |
NOTE: To reviewers, this is dropping CUDA Toolkit 8.0 support. |
Following discussion in the triage meeting, I'll look into replacing the unsigned types for these indices with an index type to see if that alleviates some of the complication of typing and casts to float. |
gpuci run tests |
gpuci run tests |
The issue with the current revision of this branch can be seen by running:
where PTX looks like: mov.u32 %r2, %tid.x;
mov.u32 %r3, %ctaid.x;
mov.u32 %r4, %ntid.x;
mad.lo.s32 %r1, %r4, %r3, %r2;
cvt.rn.f64.u32 %fd1, %r1;
cvt.rn.f64.s64 %fd2, %rd7;
setp.geu.f64 %p1, %fd1, %fd2;
@%p1 bra $L__BB0_2; |
The latest commit seems to help with avoiding float64 generation. Plans for forward progress are:
diff --git a/numba/cuda/codegen.py b/numba/cuda/codegen.py
index 47c6d36a7..6b75fc49c 100644
--- a/numba/cuda/codegen.py
+++ b/numba/cuda/codegen.py
@@ -202,6 +202,7 @@ class CUDACodeLibrary(serialize.ReduceMixin, CodeLibrary):
# Load
cufunc = module.get_function(self._entry_name)
+ print(f"\nATTRIBUTES : {self._entry_name} : {cufunc.attrs}\n")
# Populate caches
self._cufunc_cache[device.id] = cufunc |
Summary of a little more analysis:
Some scripts / notes for analysis are also added:
It would be nice to avoid any unwanted typing changes at all, so I intend to try and refine this a bit further still. |
This pull request is marked as stale as it has had no activity in the past 3 months. Please respond to this comment if you're still interested in working on this. Many thanks! |
I still plan to work on this. |
This pull request is marked as stale as it has had no activity in the past 3 months. Please respond to this comment if you're still interested in working on this. Many thanks! |
I'm still going to get round to this one day... |
This pull request is marked as stale as it has had no activity in the past 3 months. Please respond to this comment if you're still interested in working on this. Many thanks! |
I am still interested in finishing this off, when everything aligns to give me a chance of fixing it. |
This pull request is marked as stale as it has had no activity in the past 3 months. Please respond to this comment if you're still interested in working on this. Many thanks! |
Still on the to-do list for one day... |
This pull request is marked as stale as it has had no activity in the past 3 months. Please respond to this comment if you're still interested in working on this. Many thanks! |
One day... |
EDIT: All the changes in this PR excepting actually making indices unsigned are in PR #6127 for a separate, simpler review.
Original comments:
The main purpose of this PR is to make thread, block, and warp indices unsigned, in line with CUDA C / C++. This results in slightly
more efficient code using fewer instructions in some cases due to the reduction in operations relating to signedness.
There are several other required changes to support this bundled into this PR:
syncwarp()
at appropriate points in the reduction kernel. Originally I used a partial mask containing only the threads currently participating in the reduction, but this is only supported on CC 7.0 and above, so I've left it as a potential future optimization to be applied only on CC >= 7.0 devices.syncwarp()
function with no arguments, which assumes a full mask. This was not essential, but provides a nicer interface than always having to writesyncwarp(0xFFFFFFFF)
. Some tests forsyncwarp()
are also added.syncwarp()
etc. were only introduced in CUDA 9.0, and dropping CUDA 8.0 seems reasonable at this point in time.nvvm.py
that checks the supported compute capabilities, and addition of an error when one attempts to use an unspported NVVM version, alongside a warning at import time fornumba.cuda
that fires if an unsupported version is detected.For more detailed comments, see individual commit messages.