Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA: Make block, thread, and warp indices unsigned. #6112

Open
wants to merge 14 commits into
base: main
Choose a base branch
from

Conversation

gmarkall
Copy link
Member

@gmarkall gmarkall commented Aug 11, 2020

EDIT: All the changes in this PR excepting actually making indices unsigned are in PR #6127 for a separate, simpler review.

Original comments:

The main purpose of this PR is to make thread, block, and warp indices unsigned, in line with CUDA C / C++. This results in slightly
more efficient code using fewer instructions in some cases due to the reduction in operations relating to signedness.

There are several other required changes to support this bundled into this PR:

  • Synchronization of warps in reductions. The reduction kernels make an assumption that writes to shared memory by threads in a warp are automatically visible to other threads in the warp. On Volta and beyond, this is no longer true. This is fixed by adding syncwarp() at appropriate points in the reduction kernel. Originally I used a partial mask containing only the threads currently participating in the reduction, but this is only supported on CC 7.0 and above, so I've left it as a potential future optimization to be applied only on CC >= 7.0 devices.
  • Addition of the syncwarp() function with no arguments, which assumes a full mask. This was not essential, but provides a nicer interface than always having to write syncwarp(0xFFFFFFFF). Some tests for syncwarp() are also added.
  • Require CUDA 9.0. syncwarp() etc. were only introduced in CUDA 9.0, and dropping CUDA 8.0 seems reasonable at this point in time.
  • Small fixes to the code in nvvm.py that checks the supported compute capabilities, and addition of an error when one attempts to use an unspported NVVM version, alongside a warning at import time for numba.cuda that fires if an unsupported version is detected.

For more detailed comments, see individual commit messages.

This aligns with their signedness in CUDA C/C++, and results in slightly
more efficient code using fewer instructions in some cases due to the
reduction in operations relating to signedness - for example, the
reduction kernels become more efficient with this commit which increased
their tendency to race (fixed in the previous commit).
The `syncwarp()` function used in the reduction kernels was introduced
with CUDA 9.0. Rather than trying to make the availability of reduction
kernels conditional on the CUDA Toolkit version, it seems more sensible
to just drop CUDA 8.0 now that it is quite old.

This commit also includes fixes for the logic setting `SUPPORTED_CC` in
`nvvm.py` (when there was no suitable toolkit, it was creating a tuple
containing the empty tuple) and added an error when an attempt is made
to use NVVM when no supported toolkit is found.
Using a partial mask with `syncwarp()` is only supported on CC 7.0 and
greater architectures. This commit switches the implementation to always
use a full mask, with the use of a partial mask on CC 7.0 and above a
potential future optimization.
@gmarkall gmarkall added the CUDA CUDA related issue/PR label Aug 11, 2020
@gmarkall gmarkall added this to the Numba 0.52 RC milestone Aug 11, 2020
@sklam
Copy link
Member

sklam commented Aug 11, 2020

When I read the description, my first thought is that it might increase the number of unnecessary cast to float64 because of numpy semantic on mixing signed and unsigned 64-bit integers. But, it should not be a problem here because the intrinsics are uint32. Mixing with integer literals in the code will likely produce int64(e.g. uint32 + int64 -> int64)

@gmarkall
Copy link
Member Author

@sklam is the avoidance of casts to float64 the reason there were originally signed?

@gmarkall
Copy link
Member Author

So far comparing PTX before and after this patch (with a subset of the test suite) I haven't seen any instances where typing is casting to floats, but I suspect that signed / unsigned comparisons are forcing a cast to float because there are only lowering implementations for signed vs. signed and unsigned vs. unsigned. Looking into this a bit now...

@gmarkall
Copy link
Member Author

BTW @sklam - could you do a buildfarm run of this please, in case it turns up any other latent issues? This feels like the sort of thing that will surprise me on different OSes, devices, and toolkit versions (though I did endeavour to test with various toolkit versions).

@seibert
Copy link
Contributor

seibert commented Aug 11, 2020

The cast to float is only going to be a problem for uint64, which is how you are avoiding the problem here. The indices are uint32, which will cast up to int64 when combined with literal integer values. Since the indices are almost always being used to compute pointers, the slightly premature cast to int64 probably doesn't have any performance impact either.

@gmarkall
Copy link
Member Author

I've broken out the straightforward parts of this PR (everything but changing the signedness of indices) into a separate PR so that they can be reviewed / merged independently: #6127.

@gmarkall
Copy link
Member Author

In the meantime for this PR, I'm looking into an acceptable way to avoid casting to floats for the signed / unsigned comparison that occurs when comparing a value based on one of these indices with e.g. the dimension of an array (e.g. x.shape[0] and similar).

@gmarkall gmarkall added 4 - Waiting on author Waiting for author to respond to review and removed 3 - Ready for Review labels Aug 12, 2020
@stuartarchibald
Copy link
Contributor

NOTE: To reviewers, this is dropping CUDA Toolkit 8.0 support.

@gmarkall
Copy link
Member Author

NOTE: To reviewers, this is dropping CUDA Toolkit 8.0 support.

I split that out into #6127, but I left all that's in #6127 in here because it's needed for this PR to work correctly - so #6127 should be considered first, then this PR.

@stuartarchibald
Copy link
Contributor

NOTE: To reviewers, this is dropping CUDA Toolkit 8.0 support.

I split that out into #6127, but I left all that's in #6127 in here because it's needed for this PR to work correctly - so #6127 should be considered first, then this PR.

Totally missed that from reading PRs top down, thanks!

@gmarkall
Copy link
Member Author

Following discussion in the triage meeting, I'll look into replacing the unsigned types for these indices with an index type to see if that alleviates some of the complication of typing and casts to float.

@gmarkall
Copy link
Member Author

gmarkall commented Jun 8, 2022

gpuci run tests

@gmarkall
Copy link
Member Author

gmarkall commented Jun 8, 2022

gpuci run tests

@gmarkall
Copy link
Member Author

gmarkall commented Jun 8, 2022

The issue with the current revision of this branch can be seen by running:

NUMBA_DUMP_ASSEMBLY=1 python runtests.py numba.cuda.tests.cudapy.test_compiler.TestCompileToPTX.test_global_kernel

where PTX looks like:

	mov.u32 	%r2, %tid.x;
	mov.u32 	%r3, %ctaid.x;
	mov.u32 	%r4, %ntid.x;
	mad.lo.s32 	%r1, %r4, %r3, %r2;
	cvt.rn.f64.u32 	%fd1, %r1;
	cvt.rn.f64.s64 	%fd2, %rd7;
	setp.geu.f64 	%p1, %fd1, %fd2;
	@%p1 bra 	$L__BB0_2;

@gmarkall
Copy link
Member Author

The latest commit seems to help with avoiding float64 generation. Plans for forward progress are:

  • Dump the type annotations for the entire test suite with and without this change, and compare them
  • Compare register usage with and without this change. Dumping register usage for easy comparison entails:
diff --git a/numba/cuda/codegen.py b/numba/cuda/codegen.py
index 47c6d36a7..6b75fc49c 100644
--- a/numba/cuda/codegen.py
+++ b/numba/cuda/codegen.py
@@ -202,6 +202,7 @@ class CUDACodeLibrary(serialize.ReduceMixin, CodeLibrary):
 
         # Load
         cufunc = module.get_function(self._entry_name)
+        print(f"\nATTRIBUTES : {self._entry_name} : {cufunc.attrs}\n")
 
         # Populate caches
         self._cufunc_cache[device.id] = cufunc

@gmarkall
Copy link
Member Author

Summary of a little more analysis:

  • Mostly this has no effect on the register count, or makes it go down.

  • There are some functions for which the register count goes up by 1 or 2, but the average delta across the entire test suite is -0.67. Some tests have deltas of -6 to -8, which seems quite an improvement (e.g. going from 24 down to 16).

  • There is one pathological case for register usage, where the current state increates from 25 to 35 registers. The actual PTX here is shorter and simpler (because of reduced sign extension) but the PTX -> SASS compilation seems to have gone with a completely different code generation strategy that requires more registers but far fewer instructions, so this is probably not a regression. Furthermore, the PTX requires far fewer registers with this PR. From before:

    .reg .pred      %p<35>;
    .reg .b16       %rs<35>;
    .reg .b32       %r<8>;
    .reg .b64       %rd<67>;

    and after:

    .reg .pred      %p<20>;
    .reg .b16       %rs<35>;
    .reg .b32       %r<5>;
    .reg .b64       %rd<13>;
  • The typing changes introduce no new casts to floats (hooray!). However, there are occasional points where 32-bit values end up being 64-bit - e.g.:

    -        #   $30call_function.2 = call $26load_global.0(ntid, func=$26load_global.0, args=[Var(ntid, test_exception.py:66)], kws=(), vararg=None, varkwarg=None, target=None)  :: (int32,) -> range_state_int32
    +        #   $30call_function.2 = call $26load_global.0(ntid, func=$26load_global.0, args=[Var(ntid, test_exception.py:66)], kws=(), vararg=None, varkwarg=None, target=None)  :: (int64,) -> range_state_int64
    -    #   val = call $14load_method.5($const16.6, i, delta, func=$14load_method.5, args=[Var($const16.6, test_warp_ops.py:29), Var(i, test_warp_ops.py:28), Var(delta, test_warp_ops.py:28)], kws=(), vararg=None, varkwarg=None, target=None)  :: (Literal[int](4294967295), int32, int32) -> int32
    +    #   val = call $14load_method.5($const16.6, i, delta, func=$14load_method.5, args=[Var($const16.6, test_warp_ops.py:29), Var(i, test_warp_ops.py:28), Var(delta, test_warp_ops.py:28)], kws=(), vararg=None, varkwarg=None, target=None)  :: (Literal[int](4294967295), thread_idx, int32) -> int64

    I wonder if this contributes to some of the register usage increases.

Some scripts / notes for analysis are also added:

  • A script to make it easy to compare type annotations (commands.sh) when running the suite on the main branch and this branch with NUMBA_DUMP_ANNOTATION=1.
  • A script to read the dumps of the function attributes and bring them all into one data frame for comparison
  • An Excel sheet with data dumped from the aforementioned script that computes the delta and highlights the cases where there's an increase / computes some summary data.

It would be nice to avoid any unwanted typing changes at all, so I intend to try and refine this a bit further still.

@github-actions
Copy link

This pull request is marked as stale as it has had no activity in the past 3 months. Please respond to this comment if you're still interested in working on this. Many thanks!

@github-actions github-actions bot added the stale Marker label for stale issues. label Mar 30, 2023
@gmarkall
Copy link
Member Author

I still plan to work on this.

@github-actions github-actions bot removed the stale Marker label for stale issues. label Mar 31, 2023
@github-actions
Copy link

This pull request is marked as stale as it has had no activity in the past 3 months. Please respond to this comment if you're still interested in working on this. Many thanks!

@github-actions github-actions bot added the stale Marker label for stale issues. label Jun 29, 2023
@gmarkall
Copy link
Member Author

I'm still going to get round to this one day...

@github-actions github-actions bot removed the stale Marker label for stale issues. label Jul 1, 2023
@github-actions
Copy link

This pull request is marked as stale as it has had no activity in the past 3 months. Please respond to this comment if you're still interested in working on this. Many thanks!

@github-actions github-actions bot added the stale Marker label for stale issues. label Sep 30, 2023
@gmarkall
Copy link
Member Author

gmarkall commented Oct 2, 2023

I am still interested in finishing this off, when everything aligns to give me a chance of fixing it.

@gmarkall gmarkall removed the stale Marker label for stale issues. label Oct 2, 2023
Copy link

github-actions bot commented Jan 1, 2024

This pull request is marked as stale as it has had no activity in the past 3 months. Please respond to this comment if you're still interested in working on this. Many thanks!

@github-actions github-actions bot added the stale Marker label for stale issues. label Jan 1, 2024
@gmarkall
Copy link
Member Author

gmarkall commented Jan 2, 2024

Still on the to-do list for one day...

@github-actions github-actions bot removed the stale Marker label for stale issues. label Jan 3, 2024
Copy link

github-actions bot commented Apr 3, 2024

This pull request is marked as stale as it has had no activity in the past 3 months. Please respond to this comment if you're still interested in working on this. Many thanks!

@github-actions github-actions bot added the stale Marker label for stale issues. label Apr 3, 2024
@gmarkall
Copy link
Member Author

gmarkall commented Apr 3, 2024

One day...

@gmarkall gmarkall removed the stale Marker label for stale issues. label Apr 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
4 - Waiting on author Waiting for author to respond to review CUDA CUDA related issue/PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants