CUDA: Make block, thread, and warp indices unsigned. #6112

gmarkall · 2020-08-11T12:23:38Z

EDIT: All the changes in this PR excepting actually making indices unsigned are in PR #6127 for a separate, simpler review.

Original comments:

The main purpose of this PR is to make thread, block, and warp indices unsigned, in line with CUDA C / C++. This results in slightly
more efficient code using fewer instructions in some cases due to the reduction in operations relating to signedness.

There are several other required changes to support this bundled into this PR:

Synchronization of warps in reductions. The reduction kernels make an assumption that writes to shared memory by threads in a warp are automatically visible to other threads in the warp. On Volta and beyond, this is no longer true. This is fixed by adding syncwarp() at appropriate points in the reduction kernel. Originally I used a partial mask containing only the threads currently participating in the reduction, but this is only supported on CC 7.0 and above, so I've left it as a potential future optimization to be applied only on CC >= 7.0 devices.
Addition of the syncwarp() function with no arguments, which assumes a full mask. This was not essential, but provides a nicer interface than always having to write syncwarp(0xFFFFFFFF). Some tests for syncwarp() are also added.
Require CUDA 9.0. syncwarp() etc. were only introduced in CUDA 9.0, and dropping CUDA 8.0 seems reasonable at this point in time.
Small fixes to the code in nvvm.py that checks the supported compute capabilities, and addition of an error when one attempts to use an unspported NVVM version, alongside a warning at import time for numba.cuda that fires if an unsupported version is detected.

For more detailed comments, see individual commit messages.

This aligns with their signedness in CUDA C/C++, and results in slightly more efficient code using fewer instructions in some cases due to the reduction in operations relating to signedness - for example, the reduction kernels become more efficient with this commit which increased their tendency to race (fixed in the previous commit).

The `syncwarp()` function used in the reduction kernels was introduced with CUDA 9.0. Rather than trying to make the availability of reduction kernels conditional on the CUDA Toolkit version, it seems more sensible to just drop CUDA 8.0 now that it is quite old. This commit also includes fixes for the logic setting `SUPPORTED_CC` in `nvvm.py` (when there was no suitable toolkit, it was creating a tuple containing the empty tuple) and added an error when an attempt is made to use NVVM when no supported toolkit is found.

Using a partial mask with `syncwarp()` is only supported on CC 7.0 and greater architectures. This commit switches the implementation to always use a full mask, with the use of a partial mask on CC 7.0 and above a potential future optimization.

sklam · 2020-08-11T13:53:00Z

When I read the description, my first thought is that it might increase the number of unnecessary cast to float64 because of numpy semantic on mixing signed and unsigned 64-bit integers. But, it should not be a problem here because the intrinsics are uint32. Mixing with integer literals in the code will likely produce int64(e.g. uint32 + int64 -> int64)

gmarkall · 2020-08-11T14:04:17Z

@sklam is the avoidance of casts to float64 the reason there were originally signed?

gmarkall · 2020-08-11T14:45:27Z

So far comparing PTX before and after this patch (with a subset of the test suite) I haven't seen any instances where typing is casting to floats, but I suspect that signed / unsigned comparisons are forcing a cast to float because there are only lowering implementations for signed vs. signed and unsigned vs. unsigned. Looking into this a bit now...

gmarkall · 2020-08-11T14:58:53Z

BTW @sklam - could you do a buildfarm run of this please, in case it turns up any other latent issues? This feels like the sort of thing that will surprise me on different OSes, devices, and toolkit versions (though I did endeavour to test with various toolkit versions).

seibert · 2020-08-11T16:10:43Z

The cast to float is only going to be a problem for uint64, which is how you are avoiding the problem here. The indices are uint32, which will cast up to int64 when combined with literal integer values. Since the indices are almost always being used to compute pointers, the slightly premature cast to int64 probably doesn't have any performance impact either.

gmarkall · 2020-08-12T15:18:54Z

I've broken out the straightforward parts of this PR (everything but changing the signedness of indices) into a separate PR so that they can be reviewed / merged independently: #6127.

gmarkall · 2020-08-12T15:49:17Z

In the meantime for this PR, I'm looking into an acceptable way to avoid casting to floats for the signed / unsigned comparison that occurs when comparing a value based on one of these indices with e.g. the dimension of an array (e.g. x.shape[0] and similar).

stuartarchibald · 2020-08-17T09:28:43Z

NOTE: To reviewers, this is dropping CUDA Toolkit 8.0 support.

gmarkall · 2020-08-17T09:32:28Z

NOTE: To reviewers, this is dropping CUDA Toolkit 8.0 support.

I split that out into #6127, but I left all that's in #6127 in here because it's needed for this PR to work correctly - so #6127 should be considered first, then this PR.

stuartarchibald · 2020-08-17T09:33:47Z

NOTE: To reviewers, this is dropping CUDA Toolkit 8.0 support.

I split that out into #6127, but I left all that's in #6127 in here because it's needed for this PR to work correctly - so #6127 should be considered first, then this PR.

Totally missed that from reading PRs top down, thanks!

gmarkall · 2020-08-17T15:13:57Z

Following discussion in the triage meeting, I'll look into replacing the unsigned types for these indices with an index type to see if that alleviates some of the complication of typing and casts to float.

gmarkall · 2022-06-08T10:34:57Z

gpuci run tests

gmarkall · 2022-06-08T11:09:57Z

gpuci run tests

gmarkall · 2022-06-08T12:07:34Z

The issue with the current revision of this branch can be seen by running:

NUMBA_DUMP_ASSEMBLY=1 python runtests.py numba.cuda.tests.cudapy.test_compiler.TestCompileToPTX.test_global_kernel

where PTX looks like:

	mov.u32 	%r2, %tid.x;
	mov.u32 	%r3, %ctaid.x;
	mov.u32 	%r4, %ntid.x;
	mad.lo.s32 	%r1, %r4, %r3, %r2;
	cvt.rn.f64.u32 	%fd1, %r1;
	cvt.rn.f64.s64 	%fd2, %rd7;
	setp.geu.f64 	%p1, %fd1, %fd2;
	@%p1 bra 	$L__BB0_2;

gmarkall · 2022-06-22T11:33:05Z

The latest commit seems to help with avoiding float64 generation. Plans for forward progress are:

Dump the type annotations for the entire test suite with and without this change, and compare them
Compare register usage with and without this change. Dumping register usage for easy comparison entails:

diff --git a/numba/cuda/codegen.py b/numba/cuda/codegen.py
index 47c6d36a7..6b75fc49c 100644
--- a/numba/cuda/codegen.py
+++ b/numba/cuda/codegen.py
@@ -202,6 +202,7 @@ class CUDACodeLibrary(serialize.ReduceMixin, CodeLibrary):
 
         # Load
         cufunc = module.get_function(self._entry_name)
+        print(f"\nATTRIBUTES : {self._entry_name} : {cufunc.attrs}\n")
 
         # Populate caches
         self._cufunc_cache[device.id] = cufunc

gmarkall · 2022-06-23T11:40:03Z

Summary of a little more analysis:

Mostly this has no effect on the register count, or makes it go down.
There are some functions for which the register count goes up by 1 or 2, but the average delta across the entire test suite is -0.67. Some tests have deltas of -6 to -8, which seems quite an improvement (e.g. going from 24 down to 16).
There is one pathological case for register usage, where the current state increates from 25 to 35 registers. The actual PTX here is shorter and simpler (because of reduced sign extension) but the PTX -> SASS compilation seems to have gone with a completely different code generation strategy that requires more registers but far fewer instructions, so this is probably not a regression. Furthermore, the PTX requires far fewer registers with this PR. From before:
```
.reg .pred      %p<35>;
.reg .b16       %rs<35>;
.reg .b32       %r<8>;
.reg .b64       %rd<67>;
```
and after:
```
.reg .pred      %p<20>;
.reg .b16       %rs<35>;
.reg .b32       %r<5>;
.reg .b64       %rd<13>;
```

The typing changes introduce no new casts to floats (hooray!). However, there are occasional points where 32-bit values end up being 64-bit - e.g.:

-        #   $30call_function.2 = call $26load_global.0(ntid, func=$26load_global.0, args=[Var(ntid, test_exception.py:66)], kws=(), vararg=None, varkwarg=None, target=None)  :: (int32,) -> range_state_int32
+        #   $30call_function.2 = call $26load_global.0(ntid, func=$26load_global.0, args=[Var(ntid, test_exception.py:66)], kws=(), vararg=None, varkwarg=None, target=None)  :: (int64,) -> range_state_int64

-    #   val = call $14load_method.5($const16.6, i, delta, func=$14load_method.5, args=[Var($const16.6, test_warp_ops.py:29), Var(i, test_warp_ops.py:28), Var(delta, test_warp_ops.py:28)], kws=(), vararg=None, varkwarg=None, target=None)  :: (Literal[int](4294967295), int32, int32) -> int32
+    #   val = call $14load_method.5($const16.6, i, delta, func=$14load_method.5, args=[Var($const16.6, test_warp_ops.py:29), Var(i, test_warp_ops.py:28), Var(delta, test_warp_ops.py:28)], kws=(), vararg=None, varkwarg=None, target=None)  :: (Literal[int](4294967295), thread_idx, int32) -> int64

I wonder if this contributes to some of the register usage increases.

Some scripts / notes for analysis are also added:

A script to make it easy to compare type annotations (commands.sh) when running the suite on the main branch and this branch with NUMBA_DUMP_ANNOTATION=1.
A script to read the dumps of the function attributes and bring them all into one data frame for comparison
An Excel sheet with data dumped from the aforementioned script that computes the delta and highlights the cases where there's an increase / computes some summary data.

It would be nice to avoid any unwanted typing changes at all, so I intend to try and refine this a bit further still.

github-actions · 2023-03-30T01:56:09Z