-
Notifications
You must be signed in to change notification settings - Fork 407
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add data-duplicated (thread/team private) Kokkos View #825
Comments
A little more context: on KNL we typically use 64 MPI x 4 OpenMP threads per node with LAMMPS, so the memory overhead of the data duplication isn't too large. |
@stanmoore1 |
|
@ibaned 1. Actually it could be useful for CUDA to reduce atomic contention. For example, if you had 4 copies of the data then you could divide all of the threads into groups of 4. You would still need atomics between threads in the same group, but each group would only write to its group copy of the data, reducing atomic contention by a factor of 4. |
@ibaned that said, I'd be totally fine with just a plain atomic |
oh I see you can do a two-level system with grouping.... So this thing basically has a knob we can call If If Actually, users can specify the max number of copies they'll tolerate in terms of memory usage, and then if thats bigger than the number of threads we'll be nice to them and only allocate as many as needed to cover all threads. Seems roughly feasible, but its definitely a full-sized feature request. |
Exactly... |
The two-level system was actually @crtrott 's idea |
ah okay :) maybe I can get him to implement it, but not likely... |
@stanmoore1 @ibaned @hcedwar - atomics are known to be poor on multiple CPU platforms including KNL, Haswell and POWER8. There is a little improvement on some variants of ARM. One of the reasons is that they can seriously frustrate optimization of loops, vectorization etc as well as the heavy overhead associated with actually issuing atomic operations. There are some hopes this gets better in the future but its unlikely these issues are going to be completely dealt with. Having a non-atomic-based solution would be a better path forward for these platforms. |
@nmhamster would be really nice if memory systems implemented atomic vector scatter.... |
@mhoemmen - we are pushing on vendors heavily for atomic vector scatter. We can't go into detail on here (Github) but some progress is being made in this space for implementation details. |
@dsunder @hcedwar @stanmoore1 @ibaned @crtrott - I have looked into the possibility of using Prior to
After
I can issue a PR for these changes if you think we should go ahead and include them. |
Oh that is pretty cool Si and I think it is very much worth to get this in. But I agree that we should do the other things as well:
|
So, the VPIC team would also benefit from this feature. |
Is the needed feature for a data structure or for the scatter-add parallel pattern #1171 ? |
From @ibaned email:
We can iterate on this, but I like the idea of it being a data structure rather than a feature built into parallel constructs. |
From @dsunder email: Parameters in [..] are optional (like our current view).
|
Logging conversation with @stanmoore1: |
My initial concerns:
|
Example of kernel that needs this: Current custom implementation of data duplication in LAMMPS: thread-private data reduction in pure-OpenMP backend for LAMMPS:https://github.com/lammps/lammps/blob/master/src/USER-OMP/thr_data.cpp#L296 |
The data structure and overload options are definitely necessary. But are these public or behind-the-scenes implementation? |
One way to think of this is as a "drop-in" replacement for Atomic View that performs better on CPUs, by behind-the-scenes choosing between duplication and atomics. However, having the object itself be a public interface makes sense. For example, there are kernels in LAMMPS that use several of these views, making the parallel pattern implementation in #1171 much more difficult. Lifetime is a concern, users do have to explicitly ensure the lifetime of these views is tight around the kernel, but that seems manageable. |
access operator() should not be both readable and writable. My preference is for write-only. |
Design cycle to reasonable convergence before progressing with an initial implementation. |
Expanding on that: this was discussed at the Kokkos developers meeting. We agree it should be implemented, but several points of contention exist with the design.
|
@ibaned thanks for the update. Any timeline for implementation? I really need this for my codes so unless implementation by Kokkos developers is imminent, I'm planning to make a prototype myself. |
@stanmoore1 have you had a chance to chat with @crtrott on the points of contention? I can move forward with either design, just need to know which one... |
That said, it will likely be at least a week, so that prototype may not be a bad idea. |
This is BLOCKED until an agreement is reached for the design. |
@hcedwar are we going to set up a meeting to discuss this? With respect, I'm not a core Kokkos developer, and I really need this for my own codes, so making my own prototype (that I include in my own version of Kokkos) is still a possibility...if we can't agree on something soon. |
Address the feedback by @ibaned
I think using n copies instead of n-1 would make implementation a lot simpler. That does mean the reduction will also be over n copies instead of n-1 which could also reduce runtime performance, but I'm not sure if that slowdown would be more significant than an if statement in the view access.
Agreed.
I'm totally fine with that. |
Just met with @stanmoore1 , below is a summary of some of the things we discussed. The class name will likely be
These are closely related, and the default cases will be:
The ideal interface looks something like this: // This call does the memory allocation (if any)
// We need to think about names, Kokkos::Sum is already a thing (can it be reused?)
// If not specified, defaults to Sum
// If not specified, the Duplicated versus NonDuplicated is chosen based on ExecSpace
auto reduction_view = Kokkos::create_reduction_view<Sum, Duplicated>(original_view);
// This deep copies into the first "slice" if we use Christian's design
Kokkos::deep_copy(reduction_view, original_view);
Kokkos::parallel_for(niters, KOKKOS_LAMBDA(int i) {
auto j = some_function(i);
auto contribution = some_math();
// this either adds to a duplicated memory location or atomic adds to a non-duplicated one, etc.
// only operator+= is available, no operator= or read access
reduction_view(j) += contribution;
});
// So far I'm thinking this deep_copy will also do the reduction across duplicated memory
Kokkos::deep_copy(original_view, reduction_view); How memory is duplicated remains a point of discussion, I think the main pros and cons are as follows:
The two-view approach is slower inside the kernel, while the single-view approach is slower due to deep copies between it and the original view. It is currently unclear, and likely kernel-dependent, which of these slowdowns is worse. I am inclined to being development with the single-view, |
A third memory duplication approach would be to have permanently duplicated memory that has a new dimension of size |
I see that more as an alternative use case of the |
@ibaned that's true. |
I've started drafting an implementation here: |
The draft is complete and seems to be performing well. |
Right now atomic
Views
are needed in LAMMPS when summing quantities like forces on a central particle and neighboring particles because the central particle in one thread can be the neighboring particle in another thread, leading to write conflicts. I'm seeing horrible performance of atomicViews
on KNL, while performance of atomicViews
on GPUs is OK.Native (non-Kokkos) OpenMP code in LAMMPS avoids atomics by instead using duplicated (thread-private) force arrays for each thread and then reducing over the duplicated arrays at the end of the parallel loop. The LAMMPS and SPARTA codes would benefit from a Kokkos
View
that has thread-private data for OpenMP and atomics for CUDA. CUDA could even have duplicated data for each warp/team to reduce atomic write conflicts.The text was updated successfully, but these errors were encountered: