-
Notifications
You must be signed in to change notification settings - Fork 407
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Getting large chunks of memory for a thread team in a universal way #664
Comments
If I am reading that right, I believe I have a similar issue, each team needs it's own data. This is how I do it. const int rows = 10;
const int level = 1; // I think level 1 is the larger of the memory spaces
int mem_size = 0;
// add for each team local view
mem_size += View<double**>::shmem_size(rows, rows);
parallel_for (policy.set_scratch_size(level, PerTeam(mem_size)),
KOKKOS_LAMBDA(const member_type& team)
{
// team local data
View<double**> B(team.team_scratch(level), rows, rows);
}); Apologies if this was not your question. |
The data doesn't fit into shared memory on a GPU and isn't the same between teams. Some need 20m some need 1m. AFAIK this wouldn't work for my use case |
The memory |
OK, thanks, didn't know about the level arg. |
Since I'm just a Kokkos user, it would be reassuring to have a real Kokkos developer weigh in and make sure I'm not spouting nonsense. Since I have been told about this by @crtrott, I'll tag him to see if he can either shoot down or verify my claims. |
We had a bug with the level 1 thing let me check whether that is fixed. But yes this is exactly how this is supposed to work. |
Actually I think its not yet fixed, but I have a reasonable idea of how to do this. |
It does work on CPUs right now though. |
Is there a ticket on this? Can I get added to that bug report so I an keep
track?
…On Fri, Mar 3, 2017 at 9:33 AM, Christian Trott ***@***.***> wrote:
It does work on CPUs right now though.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#664 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AOPDIDVfgQGWc9NUX0ByCqMuMq5onnWjks5riEDogaJpZM4MRoyD>
.
|
Now there is :-). We just found this last week, but were all busy with upcoming travel so forgot to add the issue. |
So my preferred way of solving this seems to cause a bit of overhead for every usage of hierarchical parallelism. Basically I thought about using just a (semi) fixed number of blocks, so that I have unique identifiers. But that effectively circumvents the dynamic scheduling of the GPU. The best I could come up with costs about 3-5% in performance in the bytes_and_flops benchmark. Many schemes costs significantly more than that at least for a number of configurations of the benchmark. I'll try the same strategy as with the random number pool next. That at least limits the overhead to the cases where you actually use the scratch. |
I'll go the second way for now. I need to clean up the mess I left this in ... |
Ok I fixed this and also added a unit test to catch this in the future. |
This was never correctly implemented. Every team ended up with the same scratch space in level 1. level 0 was fine. Addresses issue #664
I'm having a hard time getting this to work. just to make sure I'm doing this correct. In my calling routine I do the following, I calculate the largest memory size I will use as follows
then I set the policy
then in the functor I call
In this example cnt is the same as max_size above, both are 101 but this returns null for the data pointer. (gdb) p AhA Where is my mistake? |
If I recall, See #195 |
Thanks, now how does this work with shared memory? I assume that is level 0 on a GPU? or is that different? I looked and there is no complex example that I could find. Also, what does level mean on CPUs? |
Hi so this is a thing which is confusing and there was a long discussion about this should work. Basically set_scratch_size returns a new policy object. It doesn't change the existing one, thats why you are in trouble. The issue is we are looking at what C++ wants to do with optional execution policy arguments. There is no consensus yet. Anyway we are probably gonna change this to make those things constructor arguments see issue #453 . Also quite at the end of the current tutorial there is a bit about scratch space. Basically you can provide numbers for both level 0 and level 1. On CPUs these two levels just collapse. If you provide numbers for both the total size is just the sum of the two. Christian |
I have an algorithm where I have thread teams where each team requires about 10M of local data, and we have aobut 1000 teams, I don't want to allocate all the views since once the team is done I can throw all that away.
What is the best way to robustly get that thread local data?
Also, is there an example of how to do this?
Thanks
Matt
The text was updated successfully, but these errors were encountered: