Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kernel awareness of hardware thread index #75

Closed
BenjaminPelletier opened this issue Mar 27, 2020 · 3 comments
Closed

Kernel awareness of hardware thread index #75

BenjaminPelletier opened this issue Mar 27, 2020 · 3 comments
Assignees
Labels

Comments

@BenjaminPelletier
Copy link

Each Accelerator has MaxNumThreads which can run simultaneously. Can a kernel invocation determine which of these threads it is running on, perhaps by accessing an integer between 0 and MaxNumThreads-1?

One application of this is when a GPU algorithm requires indexable working memory. For instance, there are a wide class of algorithms that are typically implemented using recursion. Since GPU methods may not recurse, these algorithms can be rewritten as flat loops, but in that case they generally need a "stack" space array in working memory. Because ILGPU does not support the provision of even fixed-size arrays as local variables in kernel methods (support for this would be better than the feature this issue refers to), this array must be provided as an ArrayView into the kernel. But, without a way to determine which hardware thread an invocation is using, this provided ArrayView must be the full size of all elements to be processed even though only MaxNumThreads elements of the ArrayView need to be used at any given time. So, for instance, if there were 10e6 elements to be processed, the "stack" ArrayView would need to have 10e6 * M stack depth items even though they're just working memory and no more than MaxNumThreads (order of 1e3) * M items would ever be in use at a time. If a kernel invocation could determine which hardware thread it was being run on, it could index into a merely MaxNumThreads * M ArrayView stack array rather than a 10e6 * M stack array.

@m4rs-mt
Copy link
Owner

m4rs-mt commented Mar 30, 2020

@BenjaminPelletier thanks you for your question. It sounds like you want to calculate the grid stride of your kernel (in other words, Group.DimX * Grid.DimX). This corresponds to the number of threads started in the current kernel execution environment. Adding additional support for accessing the general maximum number of threads on an accelerator is a lot of work and can easily lead to incorrect memory accesses (since the actual start size of a kernel is not known beforehand).

Please note that the next version will include 1D arrays in local memory 🔢

@m4rs-mt m4rs-mt self-assigned this Mar 30, 2020
@BenjaminPelletier
Copy link
Author

With 1D arrays in local memory, this question becomes obsolete for my use case so maybe it's not worth spending much time on.

But, in case it is worthwhile, if it's the case that 1) no two threads running simultaneously will ever have the same (Group.X, Grid.X) pair and 2) Group.DimX * Grid.DimX <= Accelerator.MaxNumThreads, then it seems like this question would be answered if I could see Group.X and Grid.X in the kernel. I'm not sure how to do that though. Currently, I'm using an Index3, I don't see where GroupedIndex* classes are that are mentioned in the documentation, and I don't see a way to retrieve a more advanced index from my Index3.

@MoFtZ
Copy link
Collaborator

MoFtZ commented Mar 30, 2020

hi @BenjaminPelletier, you are most likely using accelerator.LoadAutoGroupedStreamKernel to start your kernel.

ILGPU provides a simplified API to launch kernels without having to worry about grouping - it definitely helped me when I started GPGPU programming. If you look at all the CUDA tutorials for example, they all require you to specify the grouping as part of launching the kernel.

If you want to take control of the grouping yourself, there are other LoadXXXKernel methods to explore - see Kernel Loading in the documentation.

You should probably also check out the ILGPU samples, which helped me get my head around the various API calls.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants