Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Getting large chunks of memory for a thread team in a universal way #664

Closed
bathmatt opened this issue Mar 2, 2017 · 17 comments
Closed
Assignees
Labels
Bug Broken / incorrect code; it could be Kokkos' responsibility, or others’ (e.g., Trilinos)

Comments

@bathmatt
Copy link

bathmatt commented Mar 2, 2017

I have an algorithm where I have thread teams where each team requires about 10M of local data, and we have aobut 1000 teams, I don't want to allocate all the views since once the team is done I can throw all that away.

What is the best way to robustly get that thread local data?

Also, is there an example of how to do this?
Thanks
Matt

@dholladay00
Copy link

If I am reading that right, I believe I have a similar issue, each team needs it's own data.

This is how I do it.

const int rows = 10;
const int level = 1; // I think level 1 is the larger of the memory spaces
int mem_size = 0;
// add for each team local view
mem_size += View<double**>::shmem_size(rows, rows);
parallel_for (policy.set_scratch_size(level, PerTeam(mem_size)),
  KOKKOS_LAMBDA(const member_type& team)
{
  // team local data
  View<double**> B(team.team_scratch(level), rows, rows);
});

Apologies if this was not your question.

@bathmatt
Copy link
Author

bathmatt commented Mar 2, 2017

The data doesn't fit into shared memory on a GPU and isn't the same between teams. Some need 20m some need 1m. AFAIK this wouldn't work for my use case

@dholladay00
Copy link

The memory level I believe is supposed to specify a memory space. While I don't think it is currently implemented, my understanding is that eventually setting level = 1 would allocate in global memory.

@bathmatt
Copy link
Author

bathmatt commented Mar 3, 2017

OK, thanks, didn't know about the level arg.

@dholladay00
Copy link

Since I'm just a Kokkos user, it would be reassuring to have a real Kokkos developer weigh in and make sure I'm not spouting nonsense. Since I have been told about this by @crtrott, I'll tag him to see if he can either shoot down or verify my claims.

@crtrott
Copy link
Member

crtrott commented Mar 3, 2017

We had a bug with the level 1 thing let me check whether that is fixed. But yes this is exactly how this is supposed to work.

@crtrott
Copy link
Member

crtrott commented Mar 3, 2017

Actually I think its not yet fixed, but I have a reasonable idea of how to do this.

@crtrott
Copy link
Member

crtrott commented Mar 3, 2017

It does work on CPUs right now though.

@bathmatt
Copy link
Author

bathmatt commented Mar 3, 2017 via email

@crtrott crtrott added the Bug Broken / incorrect code; it could be Kokkos' responsibility, or others’ (e.g., Trilinos) label Mar 3, 2017
@crtrott
Copy link
Member

crtrott commented Mar 3, 2017

Now there is :-). We just found this last week, but were all busy with upcoming travel so forgot to add the issue.

@crtrott
Copy link
Member

crtrott commented Mar 5, 2017

So my preferred way of solving this seems to cause a bit of overhead for every usage of hierarchical parallelism. Basically I thought about using just a (semi) fixed number of blocks, so that I have unique identifiers. But that effectively circumvents the dynamic scheduling of the GPU. The best I could come up with costs about 3-5% in performance in the bytes_and_flops benchmark. Many schemes costs significantly more than that at least for a number of configurations of the benchmark. I'll try the same strategy as with the random number pool next. That at least limits the overhead to the cases where you actually use the scratch.

@crtrott
Copy link
Member

crtrott commented Mar 7, 2017

I'll go the second way for now. I need to clean up the mess I left this in ...

@crtrott crtrott self-assigned this Mar 8, 2017
@crtrott crtrott added this to the 2017-April-end milestone Mar 8, 2017
@crtrott
Copy link
Member

crtrott commented Mar 8, 2017

Ok I fixed this and also added a unit test to catch this in the future.

crtrott added a commit that referenced this issue Mar 8, 2017
This was never correctly implemented. Every team ended up with the
same scratch space in level 1. level 0 was fine.
Addresses issue #664
crtrott added a commit that referenced this issue Mar 8, 2017
@bathmatt
Copy link
Author

bathmatt commented Mar 10, 2017

I'm having a hard time getting this to work. just to make sure I'm doing this correct.

In my calling routine I do the following, I calculate the largest memory size I will use as follows

size_t mem_size = Kokkos::View<double**>::shmem_size(max_size,max_size);

then I set the policy

policy.set_scratch_size(1, Kokkos::PerTeam(mem_size), Kokkos::PerThread(0)) ;

then in the functor I call

Kokkos::View<double**> AhA(team_member.team_scratch(1), cnt, cnt);

In this example cnt is the same as max_size above, both are 101

but this returns null for the data pointer.

(gdb) p AhA
$13 = {<Kokkos::ViewTraits<double**>> = {}, m_track = {m_record_bits = 1, m_record = 0x1}, m_map = {m_handle = 0x0, m_offset = {
m_dim = {<Kokkos::Experimental::Impl::ViewDimension0<0ul, 2u>> = {N0 = 101}, <Kokkos::Experimental::Impl::ViewDimension1<0ul, 2u>> = {
N1 = 101}, <Kokkos::Experimental::Impl::ViewDimension2<18446744073709551615ul, 2u>> = {}, <Kokkos::Experimental::Impl::ViewDimension3<18446744073709551615ul, 2u>> = {}, <Kokkos::Experimental::Impl::ViewDimension4<18446744073709551615ul, 2u>> = {}, <Kokkos::Experimental::Impl::ViewDimension5<18446744073709551615ul, 2u>> = {}, <Kokkos::Experimental::Impl::ViewDimension6<18446744073709551615ul, 2u>> = {}, <Kokkos::Experimental::Impl::ViewDimension7<18446744073709551615ul, 2u>> = {}, }, m_stride = 101}}}

Where is my mistake?
Thanks
Matt

@dholladay00
Copy link

dholladay00 commented Mar 10, 2017

If I recall, set_scratch_size needs to be called within the parallel pattern call: parallel_for(policy.set_scratch_size…

See #195

@bathmatt
Copy link
Author

Thanks, now how does this work with shared memory? I assume that is level 0 on a GPU? or is that different? I looked and there is no complex example that I could find.

Also, what does level mean on CPUs?

@crtrott
Copy link
Member

crtrott commented Mar 10, 2017

Hi

so this is a thing which is confusing and there was a long discussion about this should work. Basically set_scratch_size returns a new policy object. It doesn't change the existing one, thats why you are in trouble. The issue is we are looking at what C++ wants to do with optional execution policy arguments. There is no consensus yet. Anyway we are probably gonna change this to make those things constructor arguments see issue #453 . Also quite at the end of the current tutorial there is a bit about scratch space. Basically you can provide numbers for both level 0 and level 1. On CPUs these two levels just collapse. If you provide numbers for both the total size is just the sum of the two.

Christian

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Broken / incorrect code; it could be Kokkos' responsibility, or others’ (e.g., Trilinos)
Projects
None yet
Development

No branches or pull requests

3 participants