-
Notifications
You must be signed in to change notification settings - Fork 407
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SharedMemory Support for Lambdas #81
Comments
What happens if both the policy and the functor have that method, to determine shared memory size? Which one does Kokkos pick? |
Interesting question. There are a couple of options: I know this is an issue, but I still think we need to support it for Lambdas. Everybody wants lambdas, including me. And the more I use them the less I want to write explicit Functors .... |
If the plan is to favor lambdas over functors, it would make sense to look at the Policy first. On the other hand, unexpected behavior is bad. I would say this: If both the functor and the Policy have the method, and if the functor asks for more shared memory than the Policy, report an error. Otherwise, the Policy determines the size. (The Policy controls the range, and league and team sizes, so it should control other aspects of the hardware as well.) I think functors and lambdas should both treat shared memory allocations as actions that could fail. They should check the allocation result and reduce over the error code. This should catch any mismatch between the Policy's specification and the functor's / lambda's expectation. |
Ok I am implementing a variant right now and it works. But I am not quite sure about the interface. This is related to another thing we consider doing, for a more general scratch memory mechanism. What I implemented right now is: TeamPolicy<>(league_size, team_size, [vector_length ,] TeamScratchSize(per_team_size, [per_thread_size=0]) ); Eventually I would like to specify scratch sizes for multiple memory spaces for the same functor. Let me know what you think of this start. Even if I push its in the Experimental namespace for now. |
OK I thought a bit more about it. Maybe its enough to use that interface but restrict a bit what you can give. Effectively a TeamPolicy would accept a templated TeamScratchSize but the MemoyrSpace you can specify is limited to TeamPolicy::execution_space::memory_space and TeamPolicy::execution_space::scratch_memory_space. Something like this: TeamPolicy(league_size,team_size, |
This is relevant for Tpetra too. Consider a parallel loop that iterates over local rows of a Tpetra::RowMatrix. RowMatrix is abstract; it exposes a row's entries by copying them into user-provided space. The number of entries per row can vary a lot in theory, though it might not vary too much in practice. Of course this isn't really the right interface for fine-grained parallelism, but it would make sense for this to work in the common case. Users' loop body should check if the scratch allocation suffices, and fail out for later retry if it doesn't. The std::vector reuse code is doing something analogous to a loop over rows of a Tpetra::RowMatrix, so fixing one case should fix the other. |
This is now available as an experimental feature |
After some discussion we want actually an interface for scratch levels. More details later. |
In light of the interface decisions we did with chunk size here is my new proposed interface: TeamPolicy<>(n,m).set_scratch_size(Level,PerTeam(size),PerThread(size)) Either PerThread or PerTeam is optional (you can give only one). |
Sufficient for the current API. Will be revisited in Summer comprehensive technical review. |
This is a topic we discussed a couple of times, but I'd like to get feedback from users.
Currently Team shared memory is only supported with functors. The functor must have a member function which returns how much shared memory it needs based on the team size.
One option to enable shared memory for lambdas would be to add options on the TeamPolicy:
TeamPolicy<> policy(N,M);
policy.set_shared_memory_size(size1);
policy.set_shared_memory_per_team_member(size2);
parallel_for(policy, ...);
This would request size1+M*size2 bytes of shared memory.
Alternatively one could also do:
policy.add_shared_memory_size(size1);
policy.add_shared_memory_per_team_member(size2);
policy.add_shared_memory_size(size3);
policy.add_shared_memory_per_team_member(size4);
This would request size1+size3+M*(size2+size4) bytes of shared memory.
Thoughts?
The text was updated successfully, but these errors were encountered: