Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add data-duplicated (thread/team private) Kokkos View #825

Closed
stanmoore1 opened this issue May 19, 2017 · 38 comments
Closed

Add data-duplicated (thread/team private) Kokkos View #825

stanmoore1 opened this issue May 19, 2017 · 38 comments
Assignees
Labels
Feature Request Create new capability; will potentially require voting
Milestone

Comments

@stanmoore1
Copy link
Contributor

stanmoore1 commented May 19, 2017

Right now atomic Views are needed in LAMMPS when summing quantities like forces on a central particle and neighboring particles because the central particle in one thread can be the neighboring particle in another thread, leading to write conflicts. I'm seeing horrible performance of atomic Views on KNL, while performance of atomic Views on GPUs is OK.

Native (non-Kokkos) OpenMP code in LAMMPS avoids atomics by instead using duplicated (thread-private) force arrays for each thread and then reducing over the duplicated arrays at the end of the parallel loop. The LAMMPS and SPARTA codes would benefit from a Kokkos View that has thread-private data for OpenMP and atomics for CUDA. CUDA could even have duplicated data for each warp/team to reduce atomic write conflicts.

@stanmoore1
Copy link
Contributor Author

stanmoore1 commented May 19, 2017

A little more context: on KNL we typically use 64 MPI x 4 OpenMP threads per node with LAMMPS, so the memory overhead of the data duplication isn't too large.

@ibaned ibaned added the Feature Request Create new capability; will potentially require voting label Jun 6, 2017
@hcedwar hcedwar added this to the Backlog milestone Jun 14, 2017
@hcedwar
Copy link
Contributor

hcedwar commented Jun 14, 2017

@stanmoore1
Performance of atomics on KNL is known to be awful. We need to support data duplication and post-kernel summation as you request. TBD as to how best to do this.

@ibaned
Copy link
Contributor

ibaned commented Jun 14, 2017

  1. This is really only feasible on CPU platforms with small numbers of threads being used. Just trying to define the scope, not trying to get out of doing it.

  2. @stanmoore1 Would it be okay if this was not Kokkos::View but rather a new type in the Kokkos containers library? Maybe call it Kokkos::ReduceView. It would be implemented as an atomic View on CUDA and implement data duplication otherwise? Then it could have semantics similar to DualView, so you could tell it when to make sure everything is reduced after your kernel (sortof like sync).

@stanmoore1
Copy link
Contributor Author

@ibaned 1. Actually it could be useful for CUDA to reduce atomic contention. For example, if you had 4 copies of the data then you could divide all of the threads into groups of 4. You would still need atomics between threads in the same group, but each group would only write to its group copy of the data, reducing atomic contention by a factor of 4.
2. Yes this should be in a container like DualView

@stanmoore1
Copy link
Contributor Author

stanmoore1 commented Jun 14, 2017

@ibaned that said, I'd be totally fine with just a plain atomic View for CUDA, since atomics on the GPU are getting better and better.

@ibaned
Copy link
Contributor

ibaned commented Jun 14, 2017

oh I see you can do a two-level system with grouping....

So this thing basically has a knob we can call number_of_copies:

If number_of_copies > 1, then number_of_copies - 1 new Views are allocated to reduce contention. Calling the reduce() method after the kernel will trigger a parallel_for non-atomic reduction amongst the copies. Calling result_view() will get you a non-atomic View with the final numbers.

If number_of_copies < ExecutionSpace::concurrency(), then all copy Views are still atomic in terms of access semantics. I need to check with other Kokkos devs what the backend-agnostic system is for getting one's thread ID, if any.

Actually, users can specify the max number of copies they'll tolerate in terms of memory usage, and then if thats bigger than the number of threads we'll be nice to them and only allocate as many as needed to cover all threads.

Seems roughly feasible, but its definitely a full-sized feature request.

@stanmoore1
Copy link
Contributor Author

Exactly...

@ibaned ibaned self-assigned this Jun 14, 2017
@stanmoore1
Copy link
Contributor Author

The two-level system was actually @crtrott 's idea

@ibaned
Copy link
Contributor

ibaned commented Jun 14, 2017

ah okay :) maybe I can get him to implement it, but not likely...

@nmhamster
Copy link
Contributor

@stanmoore1 @ibaned @hcedwar - atomics are known to be poor on multiple CPU platforms including KNL, Haswell and POWER8. There is a little improvement on some variants of ARM. One of the reasons is that they can seriously frustrate optimization of loops, vectorization etc as well as the heavy overhead associated with actually issuing atomic operations. There are some hopes this gets better in the future but its unlikely these issues are going to be completely dealt with. Having a non-atomic-based solution would be a better path forward for these platforms.

@mhoemmen
Copy link
Contributor

@nmhamster would be really nice if memory systems implemented atomic vector scatter....

@nmhamster
Copy link
Contributor

@mhoemmen - we are pushing on vendors heavily for atomic vector scatter. We can't go into detail on here (Github) but some progress is being made in this space for implementation details.

@nmhamster
Copy link
Contributor

@dsunder @hcedwar @stanmoore1 @ibaned @crtrott - I have looked into the possibility of using PREFETCHW instructions on KNL to request cache-line ownership resolution ahead of the atomic update being performed. I think it may also be possible to use these on Haswell, Broadwell etc moving forward. Benchmark results prior to the change and after the change are below (around ~17.2%) improvement for a simple case. This improvement may impact heavily contended atomics by making them slower since the cache lines are likely to bounce between request-for-ownership prefetches. I do not think we should optimize around the contended atomics case since this is likely to be slow regardless.

Prior to PREFETCHW being used

Without Atomics:                   1.79525 seconds
With Atomics:                      3.5666 seconds

After PREFETCHW is used in atomics operations

Without Atomics:                   1.79525 seconds
With Atomics:                      2.95290 seconds

I can issue a PR for these changes if you think we should go ahead and include them.

@crtrott
Copy link
Member

crtrott commented Jun 15, 2017

Oh that is pretty cool Si and I think it is very much worth to get this in.

But I agree that we should do the other things as well:

  1. Support for turning off all atomics if Serial is the only backend.
  2. Support for the data replication and staggered data replication strategy. But I think that is something we need to do in containers.

@ibaned
Copy link
Contributor

ibaned commented Oct 11, 2017

So, the VPIC team would also benefit from this feature.

@hcedwar
Copy link
Contributor

hcedwar commented Oct 17, 2017

Is the needed feature for a data structure or for the scatter-add parallel pattern #1171 ?
Separation of concerns is stronger with the parallel pattern and behind-the-scenes (implementation) data structure.

@hcedwar
Copy link
Contributor

hcedwar commented Oct 17, 2017

From @ibaned email:

Kokkos::ReductionView<DataType> a(“name”, dims…); // allocates a certain number of duplicates
 
parallel_for({ a(i) += x; }); // if #threads > #duplicates, does atomics into the duplicates, otherwise writes
a.reduce(); // adds up the duplicates
parallel_for({ y = a(i); }); // get the fully reduced value

We can iterate on this, but I like the idea of it being a data structure rather than a feature built into parallel constructs.

@hcedwar
Copy link
Contributor

hcedwar commented Oct 17, 2017

From @dsunder email:
I would modify it slightly to allow for different reductions and to emphasize that the individual contributions are separate from the result.

Parameters in [..] are optional (like our current view).

template <
 ReductionView< Type, [Layout =  DefaultExecutionSpace::array_layout], [Memory/Device = DefaultExecutionSpace], [ReductionOp = Sum] >
  r( view_alloc( “label”, [ num_duplicates = num_duplicates<Device>() ] )
    , array_dims…
    );
 
parallel_for(…, { r.contribute(i) += x; });
r.reduce();
r.parallel_for(…, {y(i) = r(i); })

@ibaned
Copy link
Contributor

ibaned commented Oct 17, 2017

Logging conversation with @stanmoore1:
If implemented as a data structure (special kind of View), assignment to and from "normal" Views is a good interface. Should support the same arbitrary DataType as a normal View, e.g. double*[3], and hide atomics behind normal += operator like Atomic-trait View does. Needs to actually not use any atomics, given the proper degree of duplication. Can keep a smart-pointer view to the "original" data, only allocate n-1 duplicates.

@hcedwar
Copy link
Contributor

hcedwar commented Oct 17, 2017

My initial concerns:

  • The lifetime of the ReductionView is greater than the parallel contribution.
  • Exposes implementation and thus restricts implementation.

@ibaned
Copy link
Contributor

ibaned commented Oct 17, 2017

@hcedwar
Copy link
Contributor

hcedwar commented Oct 17, 2017

The data structure and overload options are definitely necessary. But are these public or behind-the-scenes implementation?

@ibaned
Copy link
Contributor

ibaned commented Oct 17, 2017

One way to think of this is as a "drop-in" replacement for Atomic View that performs better on CPUs, by behind-the-scenes choosing between duplication and atomics. However, having the object itself be a public interface makes sense. For example, there are kernels in LAMMPS that use several of these views, making the parallel pattern implementation in #1171 much more difficult. Lifetime is a concern, users do have to explicitly ensure the lifetime of these views is tight around the kernel, but that seems manageable.

@ibaned
Copy link
Contributor

ibaned commented Oct 17, 2017

access operator() should not be both readable and writable. My preference is for write-only.

@hcedwar
Copy link
Contributor

hcedwar commented Oct 18, 2017

Design cycle to reasonable convergence before progressing with an initial implementation.

@ibaned
Copy link
Contributor

ibaned commented Oct 18, 2017

Expanding on that: this was discussed at the Kokkos developers meeting. We agree it should be implemented, but several points of contention exist with the design.

  1. Keeping a copy of the original View and only allocating n-1 copies will save a bit of memory, but will make each access slower (there will be an if statement instead of just index math, and this could hinder compiler optimization). This was mainly brought up by @crtrott , so he and @stanmoore1 should discuss which side of this tradeoff matters most: memory or runtime.

  2. The hybrid case of some duplication and some atomics, while potentially interesting, can be omitted from the scope of the first implementation

  3. The metaprogramming needed to add an extra dimension is non-trivial. So far we are thinking of restricting the Layout of the internal view to be only LayoutLeft or LayoutRight.

@stanmoore1
Copy link
Contributor Author

@ibaned thanks for the update. Any timeline for implementation? I really need this for my codes so unless implementation by Kokkos developers is imminent, I'm planning to make a prototype myself.

@ibaned
Copy link
Contributor

ibaned commented Oct 26, 2017

@stanmoore1 have you had a chance to chat with @crtrott on the points of contention? I can move forward with either design, just need to know which one...

@ibaned
Copy link
Contributor

ibaned commented Oct 26, 2017

That said, it will likely be at least a week, so that prototype may not be a bad idea.

@hcedwar
Copy link
Contributor

hcedwar commented Nov 1, 2017

This is BLOCKED until an agreement is reached for the design.
@stanmoore1 @crtrott @hcedwar @ibaned

@stanmoore1
Copy link
Contributor Author

@hcedwar are we going to set up a meeting to discuss this? With respect, I'm not a core Kokkos developer, and I really need this for my own codes, so making my own prototype (that I include in my own version of Kokkos) is still a possibility...if we can't agree on something soon.

@stanmoore1
Copy link
Contributor Author

stanmoore1 commented Nov 1, 2017

Address the feedback by @ibaned

Expanding on that: this was discussed at the Kokkos developers meeting. We agree it should be implemented, but several points of contention exist with the design.

Keeping a copy of the original View and only allocating n-1 copies will save a bit of memory, but will make each access slower (there will be an if statement instead of just index math, and this could hinder compiler optimization). This was mainly brought up by @crtrott , so he and @stanmoore1 should discuss which side of this tradeoff matters most: memory or runtime.

I think using n copies instead of n-1 would make implementation a lot simpler. That does mean the reduction will also be over n copies instead of n-1 which could also reduce runtime performance, but I'm not sure if that slowdown would be more significant than an if statement in the view access.

The hybrid case of some duplication and some atomics, while potentially interesting, can be omitted from the scope of the first implementation

Agreed.

The metaprogramming needed to add an extra dimension is non-trivial. So far we are thinking of restricting the Layout of the internal view to be only LayoutLeft or LayoutRight.

I'm totally fine with that.

@ibaned
Copy link
Contributor

ibaned commented Nov 1, 2017

Just met with @stanmoore1 , below is a summary of some of the things we discussed.

The class name will likely be ReductionView, and will have certain aspects that are controlled by template parameters, beyond the usual parameters to View:

  1. Whether contributions are atomically applied to memory (Atomic, NonAtomic (name up for debate))
  2. Whether (and possibly how) memory is duplicated/enlarged to reduce contention (Duplicated, NonDuplicated (name up for debate))
  3. What the reduction operator is (Sum by default)

These are closely related, and the default cases will be:

  1. Kokkos::OpenMP: Duplicated and NonAtomic
  2. Kokkos::Cuda: NonDuplicated and Atomic
  3. Kokkos::Serial: NonDuplicated and NonAtomic

The ideal interface looks something like this:

// This call does the memory allocation (if any)
// We need to think about names, Kokkos::Sum is already a thing (can it be reused?)
// If not specified, defaults to Sum
// If not specified, the Duplicated versus NonDuplicated is chosen based on ExecSpace
auto reduction_view = Kokkos::create_reduction_view<Sum, Duplicated>(original_view);
// This deep copies into the first "slice" if we use Christian's design
Kokkos::deep_copy(reduction_view, original_view);
Kokkos::parallel_for(niters, KOKKOS_LAMBDA(int i) {
  auto j = some_function(i);
  auto contribution = some_math();
  // this either adds to a duplicated memory location or atomic adds to a non-duplicated one, etc.
  // only operator+= is available, no operator= or read access
  reduction_view(j) += contribution;
});
// So far I'm thinking this deep_copy will also do the reduction across duplicated memory
Kokkos::deep_copy(original_view, reduction_view);

How memory is duplicated remains a point of discussion, I think the main pros and cons are as follows:

  1. A whole new View which has a new dimension of size nthreads. The pros are that the access operator (operator()) will be faster because its just index arithmetic. The cons are that a real memcpy is required between original_view and reduction_view, and more memory will be used.
  2. Two views, one of which is a shallow pointer to original_view, and the second has a new dimension with size nthreads-1. The pros are that no memcpy is required either before the kernel or after the reduction (the reduction algorithm can store straight into this shallow copy). The cons are that the access operator (operator()) has to choose between the two views with an if statement.

The two-view approach is slower inside the kernel, while the single-view approach is slower due to deep copies between it and the original view. It is currently unclear, and likely kernel-dependent, which of these slowdowns is worse.

I am inclined to being development with the single-view, nthreads-duplicated, index-only design and start gathering performance data.

@stanmoore1
Copy link
Contributor Author

A third memory duplication approach would be to have permanently duplicated memory that has a new dimension of size nthreads and take a subview slice to get the original view. That approach has contiguous memory, doesn't require the memcpy, and uses less memory than option 1, and doesn't need an if statement in the access operator like option 2. This third option would likely be the fastest at runtime and a similar method is used in LAMMPS in the native OpenMP code, except that instead of adding a new dimension, one of the existing dimensions is extended by a factor of nthreads. The con is that the duplicated memory is long lived and cannot be dellocated around the kernel. The integration into an application code could also be more intrusive.

@ibaned
Copy link
Contributor

ibaned commented Nov 2, 2017

A third memory duplication approach would be to have permanently duplicated memory that has a new dimension of size nthreads and take a subview slice to get the original view.

I see that more as an alternative use case of the nthreads-replicated design. In theory, I just have to provide a subview method of some sort to get the subview, and then it can be used that way.

@stanmoore1
Copy link
Contributor Author

@ibaned that's true.

@ibaned
Copy link
Contributor

ibaned commented Nov 2, 2017

I've started drafting an implementation here:
https://github.com/kokkos/kokkos/blob/issue-825/containers/src/Kokkos_ReductionView.hpp

@ibaned ibaned mentioned this issue Nov 10, 2017
@ibaned
Copy link
Contributor

ibaned commented Nov 10, 2017

The draft is complete and seems to be performing well.
Pull request #1225 is open with a full implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature Request Create new capability; will potentially require voting
Projects
None yet
Development

No branches or pull requests

6 participants