-
Notifications
You must be signed in to change notification settings - Fork 115
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kernel awareness of hardware thread index #75
Comments
@BenjaminPelletier thanks you for your question. It sounds like you want to calculate the grid stride of your kernel (in other words, Please note that the next version will include 1D arrays in local memory 🔢 |
With 1D arrays in local memory, this question becomes obsolete for my use case so maybe it's not worth spending much time on. But, in case it is worthwhile, if it's the case that 1) no two threads running simultaneously will ever have the same (Group.X, Grid.X) pair and 2) Group.DimX * Grid.DimX <= Accelerator.MaxNumThreads, then it seems like this question would be answered if I could see Group.X and Grid.X in the kernel. I'm not sure how to do that though. Currently, I'm using an Index3, I don't see where GroupedIndex* classes are that are mentioned in the documentation, and I don't see a way to retrieve a more advanced index from my Index3. |
hi @BenjaminPelletier, you are most likely using ILGPU provides a simplified API to launch kernels without having to worry about grouping - it definitely helped me when I started GPGPU programming. If you look at all the CUDA tutorials for example, they all require you to specify the grouping as part of launching the kernel. If you want to take control of the grouping yourself, there are other You should probably also check out the ILGPU samples, which helped me get my head around the various API calls. |
Each Accelerator has MaxNumThreads which can run simultaneously. Can a kernel invocation determine which of these threads it is running on, perhaps by accessing an integer between 0 and MaxNumThreads-1?
One application of this is when a GPU algorithm requires indexable working memory. For instance, there are a wide class of algorithms that are typically implemented using recursion. Since GPU methods may not recurse, these algorithms can be rewritten as flat loops, but in that case they generally need a "stack" space array in working memory. Because ILGPU does not support the provision of even fixed-size arrays as local variables in kernel methods (support for this would be better than the feature this issue refers to), this array must be provided as an ArrayView into the kernel. But, without a way to determine which hardware thread an invocation is using, this provided ArrayView must be the full size of all elements to be processed even though only MaxNumThreads elements of the ArrayView need to be used at any given time. So, for instance, if there were 10e6 elements to be processed, the "stack" ArrayView would need to have 10e6 * M stack depth items even though they're just working memory and no more than MaxNumThreads (order of 1e3) * M items would ever be in use at a time. If a kernel invocation could determine which hardware thread it was being run on, it could index into a merely MaxNumThreads * M ArrayView stack array rather than a 10e6 * M stack array.
The text was updated successfully, but these errors were encountered: