Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SharedMemory Support for Lambdas #81

Closed
crtrott opened this issue Sep 6, 2015 · 10 comments
Closed

SharedMemory Support for Lambdas #81

crtrott opened this issue Sep 6, 2015 · 10 comments
Assignees
Labels
Feature Request Create new capability; will potentially require voting
Milestone

Comments

@crtrott
Copy link
Member

crtrott commented Sep 6, 2015

This is a topic we discussed a couple of times, but I'd like to get feedback from users.
Currently Team shared memory is only supported with functors. The functor must have a member function which returns how much shared memory it needs based on the team size.

One option to enable shared memory for lambdas would be to add options on the TeamPolicy:

TeamPolicy<> policy(N,M);
policy.set_shared_memory_size(size1);
policy.set_shared_memory_per_team_member(size2);

parallel_for(policy, ...);
This would request size1+M*size2 bytes of shared memory.

Alternatively one could also do:
policy.add_shared_memory_size(size1);
policy.add_shared_memory_per_team_member(size2);
policy.add_shared_memory_size(size3);
policy.add_shared_memory_per_team_member(size4);

This would request size1+size3+M*(size2+size4) bytes of shared memory.

Thoughts?

@crtrott crtrott added the Feature Request Create new capability; will potentially require voting label Sep 6, 2015
@mhoemmen
Copy link
Contributor

mhoemmen commented Sep 7, 2015

What happens if both the policy and the functor have that method, to determine shared memory size? Which one does Kokkos pick?

@crtrott
Copy link
Member Author

crtrott commented Sep 7, 2015

Interesting question. There are a couple of options:
(i) Throw if they don't agree
(ii) Take the max of the two
(iii) Precedent for Functor since it knows the work best.
(iv) Precedent for Policy: since it is a specific call position for a specific functor.

I know this is an issue, but I still think we need to support it for Lambdas. Everybody wants lambdas, including me. And the more I use them the less I want to write explicit Functors ....

@mhoemmen
Copy link
Contributor

mhoemmen commented Sep 7, 2015

If the plan is to favor lambdas over functors, it would make sense to look at the Policy first. On the other hand, unexpected behavior is bad. I would say this: If both the functor and the Policy have the method, and if the functor asks for more shared memory than the Policy, report an error. Otherwise, the Policy determines the size. (The Policy controls the range, and league and team sizes, so it should control other aspects of the hardware as well.)

I think functors and lambdas should both treat shared memory allocations as actions that could fail. They should check the allocation result and reduce over the error code. This should catch any mismatch between the Policy's specification and the functor's / lambda's expectation.

@crtrott
Copy link
Member Author

crtrott commented Oct 22, 2015

Ok I am implementing a variant right now and it works. But I am not quite sure about the interface. This is related to another thing we consider doing, for a more general scratch memory mechanism.

What I implemented right now is:

TeamPolicy<>(league_size, team_size, [vector_length ,] TeamScratchSize(per_team_size, [per_thread_size=0]) );

Eventually I would like to specify scratch sizes for multiple memory spaces for the same functor.
This is something we have now encountered in multiple big apps, that we need scratch memory in various size regimes, going from kB to GB. This happens in Lulesh, Nalu, and the SM apps.
In those cases they currently allocate std::vectors inside the iterations (or at least resize them). In parallel I need copies of those for every concurrently handled iteration. Some of those allocations are for the innermost levels (TeamThread loop) some are on the outer level. The sizes vary as well. Some of the smaller ones could fit in real team shared memory, others must stay in some larger space.

Let me know what you think of this start. Even if I push its in the Experimental namespace for now.

@crtrott
Copy link
Member Author

crtrott commented Oct 22, 2015

OK I thought a bit more about it. Maybe its enough to use that interface but restrict a bit what you can give. Effectively a TeamPolicy would accept a templated TeamScratchSize but the MemoyrSpace you can specify is limited to TeamPolicy::execution_space::memory_space and TeamPolicy::execution_space::scratch_memory_space. Something like this:

TeamPolicy(league_size,team_size,
TeamScratchSize(per_team_CS,per_thread_CS),
TeamScratchSizeCuda::scratch_memory_space(per_team_CSS,per_thread_CSS))

@mhoemmen
Copy link
Contributor

This is relevant for Tpetra too. Consider a parallel loop that iterates over local rows of a Tpetra::RowMatrix. RowMatrix is abstract; it exposes a row's entries by copying them into user-provided space. The number of entries per row can vary a lot in theory, though it might not vary too much in practice. Of course this isn't really the right interface for fine-grained parallelism, but it would make sense for this to work in the common case. Users' loop body should check if the scratch allocation suffices, and fail out for later retry if it doesn't.

The std::vector reuse code is doing something analogous to a loop over rows of a Tpetra::RowMatrix, so fixing one case should fix the other.

crtrott added a commit that referenced this issue Oct 31, 2015
This adds shared memory support for lambdas according to issue #81
to Cuda, Pthreads and Serial. It also adds a unit test.

This does not add the generic scratch space discussed in issue #81.
@crtrott
Copy link
Member Author

crtrott commented Oct 31, 2015

This is now available as an experimental feature

hcedwar pushed a commit to hcedwar/kokkos that referenced this issue Nov 11, 2015
This adds shared memory support for lambdas according to issue kokkos#81
to Cuda, Pthreads and Serial. It also adds a unit test.

This does not add the generic scratch space discussed in issue kokkos#81.
hcedwar pushed a commit to hcedwar/kokkos that referenced this issue Nov 12, 2015
This adds shared memory support for lambdas according to issue kokkos#81
to Cuda, Pthreads and Serial. It also adds a unit test.

This does not add the generic scratch space discussed in issue kokkos#81.
@crtrott crtrott added this to the GTC 2016 milestone Nov 23, 2015
@crtrott
Copy link
Member Author

crtrott commented Nov 23, 2015

After some discussion we want actually an interface for scratch levels. More details later.

@crtrott crtrott self-assigned this Jan 14, 2016
@crtrott
Copy link
Member Author

crtrott commented Jan 14, 2016

In light of the interface decisions we did with chunk size here is my new proposed interface:

TeamPolicy<>(n,m).set_scratch_size(Level,PerTeam(size),PerThread(size))

Either PerThread or PerTeam is optional (you can give only one).

@hcedwar
Copy link
Contributor

hcedwar commented Mar 30, 2016

Sufficient for the current API. Will be revisited in Summer comprehensive technical review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature Request Create new capability; will potentially require voting
Projects
None yet
Development

No branches or pull requests

3 participants