Getting large chunks of memory for a thread team in a universal way #664

bathmatt · 2017-03-02T22:50:01Z

I have an algorithm where I have thread teams where each team requires about 10M of local data, and we have aobut 1000 teams, I don't want to allocate all the views since once the team is done I can throw all that away.

What is the best way to robustly get that thread local data?

Also, is there an example of how to do this?
Thanks
Matt

dholladay00 · 2017-03-02T23:09:53Z

If I am reading that right, I believe I have a similar issue, each team needs it's own data.

This is how I do it.

const int rows = 10;
const int level = 1; // I think level 1 is the larger of the memory spaces
int mem_size = 0;
// add for each team local view
mem_size += View<double**>::shmem_size(rows, rows);
parallel_for (policy.set_scratch_size(level, PerTeam(mem_size)),
  KOKKOS_LAMBDA(const member_type& team)
{
  // team local data
  View<double**> B(team.team_scratch(level), rows, rows);
});

Apologies if this was not your question.

bathmatt · 2017-03-02T23:19:16Z

The data doesn't fit into shared memory on a GPU and isn't the same between teams. Some need 20m some need 1m. AFAIK this wouldn't work for my use case

dholladay00 · 2017-03-02T23:31:34Z

The memory level I believe is supposed to specify a memory space. While I don't think it is currently implemented, my understanding is that eventually setting level = 1 would allocate in global memory.

bathmatt · 2017-03-03T14:44:48Z

OK, thanks, didn't know about the level arg.

dholladay00 · 2017-03-03T15:57:40Z

Since I'm just a Kokkos user, it would be reassuring to have a real Kokkos developer weigh in and make sure I'm not spouting nonsense. Since I have been told about this by @crtrott, I'll tag him to see if he can either shoot down or verify my claims.

crtrott · 2017-03-03T16:28:56Z

We had a bug with the level 1 thing let me check whether that is fixed. But yes this is exactly how this is supposed to work.

crtrott · 2017-03-03T16:33:24Z

Actually I think its not yet fixed, but I have a reasonable idea of how to do this.

crtrott · 2017-03-03T16:33:43Z

It does work on CPUs right now though.

bathmatt · 2017-03-03T16:42:34Z

Is there a ticket on this? Can I get added to that bug report so I an keep track?

…

On Fri, Mar 3, 2017 at 9:33 AM, Christian Trott ***@***.***> wrote: It does work on CPUs right now though. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#664 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AOPDIDVfgQGWc9NUX0ByCqMuMq5onnWjks5riEDogaJpZM4MRoyD> .

crtrott · 2017-03-03T19:25:26Z

Now there is :-). We just found this last week, but were all busy with upcoming travel so forgot to add the issue.

crtrott · 2017-03-05T17:55:26Z

So my preferred way of solving this seems to cause a bit of overhead for every usage of hierarchical parallelism. Basically I thought about using just a (semi) fixed number of blocks, so that I have unique identifiers. But that effectively circumvents the dynamic scheduling of the GPU. The best I could come up with costs about 3-5% in performance in the bytes_and_flops benchmark. Many schemes costs significantly more than that at least for a number of configurations of the benchmark. I'll try the same strategy as with the random number pool next. That at least limits the overhead to the cases where you actually use the scratch.

crtrott · 2017-03-07T23:19:35Z

I'll go the second way for now. I need to clean up the mess I left this in ...

crtrott · 2017-03-08T00:32:23Z

Ok I fixed this and also added a unit test to catch this in the future.

This was never correctly implemented. Every team ended up with the same scratch space in level 1. level 0 was fine. Addresses issue #664

part of #664

bathmatt · 2017-03-10T03:19:55Z

I'm having a hard time getting this to work. just to make sure I'm doing this correct.

In my calling routine I do the following, I calculate the largest memory size I will use as follows

size_t mem_size = Kokkos::View<double**>::shmem_size(max_size,max_size);

then I set the policy

policy.set_scratch_size(1, Kokkos::PerTeam(mem_size), Kokkos::PerThread(0)) ;

then in the functor I call

Kokkos::View<double**> AhA(team_member.team_scratch(1), cnt, cnt);

In this example cnt is the same as max_size above, both are 101

but this returns null for the data pointer.

(gdb) p AhA
$13 = {<Kokkos::ViewTraits<double**>> = {}, m_track = {m_record_bits = 1, m_record = 0x1}, m_map = {m_handle = 0x0, m_offset = {
m_dim = {<Kokkos::Experimental::Impl::ViewDimension0<0ul, 2u>> = {N0 = 101}, <Kokkos::Experimental::Impl::ViewDimension1<0ul, 2u>> = {
N1 = 101}, <Kokkos::Experimental::Impl::ViewDimension2<18446744073709551615ul, 2u>> = {}, <Kokkos::Experimental::Impl::ViewDimension3<18446744073709551615ul, 2u>> = {}, <Kokkos::Experimental::Impl::ViewDimension4<18446744073709551615ul, 2u>> = {}, <Kokkos::Experimental::Impl::ViewDimension5<18446744073709551615ul, 2u>> = {}, <Kokkos::Experimental::Impl::ViewDimension6<18446744073709551615ul, 2u>> = {}, <Kokkos::Experimental::Impl::ViewDimension7<18446744073709551615ul, 2u>> = {}, }, m_stride = 101}}}

Where is my mistake?
Thanks
Matt

dholladay00 · 2017-03-10T04:11:32Z

If I recall, set_scratch_size needs to be called within the parallel pattern call: parallel_for(policy.set_scratch_size…

See #195

bathmatt · 2017-03-10T14:33:39Z

Thanks, now how does this work with shared memory? I assume that is level 0 on a GPU? or is that different? I looked and there is no complex example that I could find.

Also, what does level mean on CPUs?

crtrott · 2017-03-10T17:00:43Z

Hi

so this is a thing which is confusing and there was a long discussion about this should work. Basically set_scratch_size returns a new policy object. It doesn't change the existing one, thats why you are in trouble. The issue is we are looking at what C++ wants to do with optional execution policy arguments. There is no consensus yet. Anyway we are probably gonna change this to make those things constructor arguments see issue #453 . Also quite at the end of the current tutorial there is a bit about scratch space. Basically you can provide numbers for both level 0 and level 1. On CPUs these two levels just collapse. If you provide numbers for both the total size is just the sum of the two.

Christian

crtrott added the Bug Broken / incorrect code; it could be Kokkos' responsibility, or others’ (e.g., Trilinos) label Mar 3, 2017

crtrott self-assigned this Mar 8, 2017

crtrott added this to the 2017-April-end milestone Mar 8, 2017

crtrott added a commit that referenced this issue Mar 8, 2017

Core: Fixing level 1 scratch space for Cuda

53a9cef

This was never correctly implemented. Every team ended up with the same scratch space in level 1. level 0 was fine. Addresses issue #664

crtrott added a commit that referenced this issue Mar 8, 2017

Core: Fixing test for scratch space to catch if two teams share

eb4ae7e

part of #664

crtrott added a commit that referenced this issue Mar 8, 2017

Fix warning introduced with fix for #664

4fe7f9e

crtrott added bug - fix pushed to develop branch labels Mar 8, 2017

bathmatt closed this as completed Mar 10, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Getting large chunks of memory for a thread team in a universal way #664

Getting large chunks of memory for a thread team in a universal way #664

bathmatt commented Mar 2, 2017 •

edited

dholladay00 commented Mar 2, 2017

bathmatt commented Mar 2, 2017

dholladay00 commented Mar 2, 2017

bathmatt commented Mar 3, 2017

dholladay00 commented Mar 3, 2017

crtrott commented Mar 3, 2017

crtrott commented Mar 3, 2017

crtrott commented Mar 3, 2017

bathmatt commented Mar 3, 2017 via email

crtrott commented Mar 3, 2017

crtrott commented Mar 5, 2017

crtrott commented Mar 7, 2017

crtrott commented Mar 8, 2017

bathmatt commented Mar 10, 2017 •

edited

dholladay00 commented Mar 10, 2017 •

edited

bathmatt commented Mar 10, 2017

crtrott commented Mar 10, 2017

Getting large chunks of memory for a thread team in a universal way #664

Getting large chunks of memory for a thread team in a universal way #664

Comments

bathmatt commented Mar 2, 2017 • edited

dholladay00 commented Mar 2, 2017

bathmatt commented Mar 2, 2017

dholladay00 commented Mar 2, 2017

bathmatt commented Mar 3, 2017

dholladay00 commented Mar 3, 2017

crtrott commented Mar 3, 2017

crtrott commented Mar 3, 2017

crtrott commented Mar 3, 2017

bathmatt commented Mar 3, 2017 via email

crtrott commented Mar 3, 2017

crtrott commented Mar 5, 2017

crtrott commented Mar 7, 2017

crtrott commented Mar 8, 2017

bathmatt commented Mar 10, 2017 • edited

dholladay00 commented Mar 10, 2017 • edited

bathmatt commented Mar 10, 2017

crtrott commented Mar 10, 2017

bathmatt commented Mar 2, 2017 •

edited

bathmatt commented Mar 10, 2017 •

edited

dholladay00 commented Mar 10, 2017 •

edited