New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove macro expansion and replace uses with FE typing + BE lowering #5465
Conversation
…ause of the way module globals are treated as constants
A simple demo: ``` from numba import cuda import sys import numpy as np @cuda.jit def k(x): i = cuda.grid(1) tid = cuda.threadIdx x[i] = tid.x x = np.zeros(32, dtype=np.int32) k[1, 32](x) print(x) ``` prints: ``` [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31] ``` There is no change in the final PTX for this simple example.
This is consistent with Numba master at present - however, in CUDA C/C++ these values are unsigned. Making them unsigned seems to cause issues in reduce kernels, which requires investigation.
The test_set_registers_* tests were a little too strict in what they checked - the max_registers option should limit the number of registers used, but does not guarantee that as many registers as the limit are used. When linking for some devices, fewer than the maximum are used. To remedy this, the tests now check that there are less than or equal to the max registers used. An additional test is added, to ensure that the maximum register count would otherwise have been exceeded.
Also, it looks like there's no use for the Any thoughts on the removal of |
@gmarkall thanks for submitting this, I have marked it as in-progress for now. |
@esc Thanks - I'm going to split it in two as Siu suggested so that the CUDA changes can go in without needing to be blocked on testing the one change to the ROC target. I'll keep this PR for the removal of all macro expansion code, and make a new one with just CUDA. |
/AzurePipelines run |
Azure Pipelines successfully started running 1 pipeline(s). |
A CUDA-only version of this PR is in #5481, which is ready for review - once that is approved and/or merged, I will amend this PR accordingly. |
Per recent discussions (in the dev meeting, IIRC) it's not urgent or likely to get this into 0.51, so I've bumped the milestone to 0.52. |
Thanks @gmarkall |
This is modelled on the implementation in the CUDA target, which should work similarly.
Co-authored-by: stuartarchibald <stuartarchibald@users.noreply.github.com>
For 5daa538 the things I'd expect to work ok on ROCm in light of these changes still work, so I think it's ok on that hardware target. The patch itself looks good to me, @sklam did you want to take another look? If not, this can go through the build farm and be merged if it passes. Thanks again for the large refactoring effort, very pleased to see |
It will just need to be tested on real ROC hw in the farm to pick up anything else we missed by eyeballing. |
I've run this on real ROC hw locally, the things I'd expect to work ok are working. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this and for all the fixes etc, especially guessing for non-local hardware! Looks good!
BFID: numba_smoketest_cuda_98 |
Buildfarm passed. |
As title. WIP PR to test on CI, and for testing the change to the ROC backend (I don't have a device to test). This will need a bit of tidying up and refactoring before I remove the WIP tag.
Also includes the fix to #5408, for testing convenience.
@stuartarchibald are you able to try this on a ROC device please?