Add data-duplicated (thread/team private) Kokkos View #825

stanmoore1 · 2017-05-19T20:41:31Z

Right now atomic Views are needed in LAMMPS when summing quantities like forces on a central particle and neighboring particles because the central particle in one thread can be the neighboring particle in another thread, leading to write conflicts. I'm seeing horrible performance of atomic Views on KNL, while performance of atomic Views on GPUs is OK.

Native (non-Kokkos) OpenMP code in LAMMPS avoids atomics by instead using duplicated (thread-private) force arrays for each thread and then reducing over the duplicated arrays at the end of the parallel loop. The LAMMPS and SPARTA codes would benefit from a Kokkos View that has thread-private data for OpenMP and atomics for CUDA. CUDA could even have duplicated data for each warp/team to reduce atomic write conflicts.

The text was updated successfully, but these errors were encountered:

stanmoore1 · 2017-05-19T21:05:27Z

A little more context: on KNL we typically use 64 MPI x 4 OpenMP threads per node with LAMMPS, so the memory overhead of the data duplication isn't too large.

hcedwar · 2017-06-14T22:25:11Z

@stanmoore1
Performance of atomics on KNL is known to be awful. We need to support data duplication and post-kernel summation as you request. TBD as to how best to do this.

ibaned · 2017-06-14T22:33:58Z

This is really only feasible on CPU platforms with small numbers of threads being used. Just trying to define the scope, not trying to get out of doing it.
@stanmoore1 Would it be okay if this was not Kokkos::View but rather a new type in the Kokkos containers library? Maybe call it Kokkos::ReduceView. It would be implemented as an atomic View on CUDA and implement data duplication otherwise? Then it could have semantics similar to DualView, so you could tell it when to make sure everything is reduced after your kernel (sortof like sync).

stanmoore1 · 2017-06-14T22:46:17Z

@ibaned 1. Actually it could be useful for CUDA to reduce atomic contention. For example, if you had 4 copies of the data then you could divide all of the threads into groups of 4. You would still need atomics between threads in the same group, but each group would only write to its group copy of the data, reducing atomic contention by a factor of 4.
2. Yes this should be in a container like DualView

stanmoore1 · 2017-06-14T22:48:41Z

@ibaned that said, I'd be totally fine with just a plain atomic View for CUDA, since atomics on the GPU are getting better and better.

ibaned · 2017-06-14T22:56:16Z

oh I see you can do a two-level system with grouping....

So this thing basically has a knob we can call number_of_copies:

If number_of_copies > 1, then number_of_copies - 1 new Views are allocated to reduce contention. Calling the reduce() method after the kernel will trigger a parallel_for non-atomic reduction amongst the copies. Calling result_view() will get you a non-atomic View with the final numbers.

If number_of_copies < ExecutionSpace::concurrency(), then all copy Views are still atomic in terms of access semantics. I need to check with other Kokkos devs what the backend-agnostic system is for getting one's thread ID, if any.

Actually, users can specify the max number of copies they'll tolerate in terms of memory usage, and then if thats bigger than the number of threads we'll be nice to them and only allocate as many as needed to cover all threads.

Seems roughly feasible, but its definitely a full-sized feature request.

stanmoore1 · 2017-06-14T22:58:55Z

Exactly...

stanmoore1 · 2017-06-14T23:05:11Z

The two-level system was actually @crtrott 's idea

ibaned · 2017-06-14T23:06:34Z

ah okay :) maybe I can get him to implement it, but not likely...

nmhamster · 2017-06-15T03:14:49Z

@stanmoore1 @ibaned @hcedwar - atomics are known to be poor on multiple CPU platforms including KNL, Haswell and POWER8. There is a little improvement on some variants of ARM. One of the reasons is that they can seriously frustrate optimization of loops, vectorization etc as well as the heavy overhead associated with actually issuing atomic operations. There are some hopes this gets better in the future but its unlikely these issues are going to be completely dealt with. Having a non-atomic-based solution would be a better path forward for these platforms.

mhoemmen · 2017-06-15T15:27:38Z

@nmhamster would be really nice if memory systems implemented atomic vector scatter....

nmhamster · 2017-06-15T16:10:19Z

@mhoemmen - we are pushing on vendors heavily for atomic vector scatter. We can't go into detail on here (Github) but some progress is being made in this space for implementation details.

nmhamster · 2017-06-15T16:15:34Z

@dsunder @hcedwar @stanmoore1 @ibaned @crtrott - I have looked into the possibility of using PREFETCHW instructions on KNL to request cache-line ownership resolution ahead of the atomic update being performed. I think it may also be possible to use these on Haswell, Broadwell etc moving forward. Benchmark results prior to the change and after the change are below (around ~17.2%) improvement for a simple case. This improvement may impact heavily contended atomics by making them slower since the cache lines are likely to bounce between request-for-ownership prefetches. I do not think we should optimize around the contended atomics case since this is likely to be slow regardless.

Prior to PREFETCHW being used

Without Atomics:                   1.79525 seconds
With Atomics:                      3.5666 seconds

After PREFETCHW is used in atomics operations

Without Atomics:                   1.79525 seconds
With Atomics:                      2.95290 seconds

I can issue a PR for these changes if you think we should go ahead and include them.

crtrott · 2017-06-15T21:02:28Z

Oh that is pretty cool Si and I think it is very much worth to get this in.

But I agree that we should do the other things as well:

Support for turning off all atomics if Serial is the only backend.
Support for the data replication and staggered data replication strategy. But I think that is something we need to do in containers.

ibaned · 2017-10-11T18:48:23Z

So, the VPIC team would also benefit from this feature.

hcedwar · 2017-10-17T17:37:50Z

Is the needed feature for a data structure or for the scatter-add parallel pattern #1171 ?
Separation of concerns is stronger with the parallel pattern and behind-the-scenes (implementation) data structure.

hcedwar · 2017-10-17T18:06:38Z

From @ibaned email:

Kokkos::ReductionView<DataType> a(“name”, dims…); // allocates a certain number of duplicates
 
parallel_for({ a(i) += x; }); // if #threads > #duplicates, does atomics into the duplicates, otherwise writes
a.reduce(); // adds up the duplicates
parallel_for({ y = a(i); }); // get the fully reduced value

We can iterate on this, but I like the idea of it being a data structure rather than a feature built into parallel constructs.

hcedwar · 2017-10-17T18:07:30Z

From @dsunder email:
I would modify it slightly to allow for different reductions and to emphasize that the individual contributions are separate from the result.

Parameters in [..] are optional (like our current view).

template <
 ReductionView< Type, [Layout =  DefaultExecutionSpace::array_layout], [Memory/Device = DefaultExecutionSpace], [ReductionOp = Sum] >
  r( view_alloc( “label”, [ num_duplicates = num_duplicates<Device>() ] )
    , array_dims…
    );
 
parallel_for(…, { r.contribute(i) += x; });
r.reduce();
r.parallel_for(…, {y(i) = r(i); })

ibaned · 2017-10-17T18:09:18Z

Logging conversation with @stanmoore1:
If implemented as a data structure (special kind of View), assignment to and from "normal" Views is a good interface. Should support the same arbitrary DataType as a normal View, e.g. double*[3], and hide atomics behind normal += operator like Atomic-trait View does. Needs to actually not use any atomics, given the proper degree of duplication. Can keep a smart-pointer view to the "original" data, only allocate n-1 duplicates.

hcedwar · 2017-10-17T18:10:25Z

My initial concerns:

The lifetime of the ReductionView is greater than the parallel contribution.
Exposes implementation and thus restricts implementation.

ibaned · 2017-10-17T18:16:46Z

Example of kernel that needs this:
https://github.com/lammps/lammps/blob/master/src/KOKKOS/pair_reaxc_kokkos.cpp#L2240

Current custom implementation of data duplication in LAMMPS:
https://github.com/lammps/lammps/blob/master/src/KOKKOS/pair_exp6_rx_kokkos.cpp#L316
https://github.com/lammps/lammps/blob/master/src/KOKKOS/pair_exp6_rx_kokkos.cpp#L344

thread-private data reduction in pure-OpenMP backend for LAMMPS:https://github.com/lammps/lammps/blob/master/src/USER-OMP/thr_data.cpp#L296

hcedwar · 2017-10-17T18:23:22Z

The data structure and overload options are definitely necessary. But are these public or behind-the-scenes implementation?

ibaned · 2017-10-17T18:31:53Z

One way to think of this is as a "drop-in" replacement for Atomic View that performs better on CPUs, by behind-the-scenes choosing between duplication and atomics. However, having the object itself be a public interface makes sense. For example, there are kernels in LAMMPS that use several of these views, making the parallel pattern implementation in #1171 much more difficult. Lifetime is a concern, users do have to explicitly ensure the lifetime of these views is tight around the kernel, but that seems manageable.

ibaned · 2017-10-17T18:51:25Z

access operator() should not be both readable and writable. My preference is for write-only.

hcedwar · 2017-10-18T18:46:30Z

Design cycle to reasonable convergence before progressing with an initial implementation.

ibaned · 2017-10-18T19:26:37Z

Expanding on that: this was discussed at the Kokkos developers meeting. We agree it should be implemented, but several points of contention exist with the design.

Keeping a copy of the original View and only allocating n-1 copies will save a bit of memory, but will make each access slower (there will be an if statement instead of just index math, and this could hinder compiler optimization). This was mainly brought up by @crtrott , so he and @stanmoore1 should discuss which side of this tradeoff matters most: memory or runtime.
The hybrid case of some duplication and some atomics, while potentially interesting, can be omitted from the scope of the first implementation
The metaprogramming needed to add an extra dimension is non-trivial. So far we are thinking of restricting the Layout of the internal view to be only LayoutLeft or LayoutRight.

stanmoore1 · 2017-10-25T22:56:13Z

@ibaned thanks for the update. Any timeline for implementation? I really need this for my codes so unless implementation by Kokkos developers is imminent, I'm planning to make a prototype myself.

ibaned · 2017-10-26T01:51:28Z

@stanmoore1 have you had a chance to chat with @crtrott on the points of contention? I can move forward with either design, just need to know which one...

ibaned · 2017-10-26T02:05:14Z

That said, it will likely be at least a week, so that prototype may not be a bad idea.

hcedwar · 2017-11-01T18:39:10Z

This is BLOCKED until an agreement is reached for the design.
@stanmoore1 @crtrott @hcedwar @ibaned

stanmoore1 · 2017-11-01T18:45:43Z

@hcedwar are we going to set up a meeting to discuss this? With respect, I'm not a core Kokkos developer, and I really need this for my own codes, so making my own prototype (that I include in my own version of Kokkos) is still a possibility...if we can't agree on something soon.

stanmoore1 · 2017-11-01T20:01:00Z

Address the feedback by @ibaned

Expanding on that: this was discussed at the Kokkos developers meeting. We agree it should be implemented, but several points of contention exist with the design.

Keeping a copy of the original View and only allocating n-1 copies will save a bit of memory, but will make each access slower (there will be an if statement instead of just index math, and this could hinder compiler optimization). This was mainly brought up by @crtrott , so he and @stanmoore1 should discuss which side of this tradeoff matters most: memory or runtime.

I think using n copies instead of n-1 would make implementation a lot simpler. That does mean the reduction will also be over n copies instead of n-1 which could also reduce runtime performance, but I'm not sure if that slowdown would be more significant than an if statement in the view access.

The hybrid case of some duplication and some atomics, while potentially interesting, can be omitted from the scope of the first implementation

Agreed.

The metaprogramming needed to add an extra dimension is non-trivial. So far we are thinking of restricting the Layout of the internal view to be only LayoutLeft or LayoutRight.

I'm totally fine with that.

ibaned · 2017-11-01T23:32:33Z

Just met with @stanmoore1 , below is a summary of some of the things we discussed.

The class name will likely be ReductionView, and will have certain aspects that are controlled by template parameters, beyond the usual parameters to View:

Whether contributions are atomically applied to memory (Atomic, NonAtomic (name up for debate))
Whether (and possibly how) memory is duplicated/enlarged to reduce contention (Duplicated, NonDuplicated (name up for debate))
What the reduction operator is (Sum by default)

These are closely related, and the default cases will be:

Kokkos::OpenMP: Duplicated and NonAtomic
Kokkos::Cuda: NonDuplicated and Atomic
Kokkos::Serial: NonDuplicated and NonAtomic

The ideal interface looks something like this:

// This call does the memory allocation (if any)
// We need to think about names, Kokkos::Sum is already a thing (can it be reused?)
// If not specified, defaults to Sum
// If not specified, the Duplicated versus NonDuplicated is chosen based on ExecSpace
auto reduction_view = Kokkos::create_reduction_view<Sum, Duplicated>(original_view);
// This deep copies into the first "slice" if we use Christian's design
Kokkos::deep_copy(reduction_view, original_view);
Kokkos::parallel_for(niters, KOKKOS_LAMBDA(int i) {
  auto j = some_function(i);
  auto contribution = some_math();
  // this either adds to a duplicated memory location or atomic adds to a non-duplicated one, etc.
  // only operator+= is available, no operator= or read access
  reduction_view(j) += contribution;
});
// So far I'm thinking this deep_copy will also do the reduction across duplicated memory
Kokkos::deep_copy(original_view, reduction_view);

How memory is duplicated remains a point of discussion, I think the main pros and cons are as follows:

A whole new View which has a new dimension of size nthreads. The pros are that the access operator (operator()) will be faster because its just index arithmetic. The cons are that a real memcpy is required between original_view and reduction_view, and more memory will be used.
Two views, one of which is a shallow pointer to original_view, and the second has a new dimension with size nthreads-1. The pros are that no memcpy is required either before the kernel or after the reduction (the reduction algorithm can store straight into this shallow copy). The cons are that the access operator (operator()) has to choose between the two views with an if statement.

The two-view approach is slower inside the kernel, while the single-view approach is slower due to deep copies between it and the original view. It is currently unclear, and likely kernel-dependent, which of these slowdowns is worse.

I am inclined to being development with the single-view, nthreads-duplicated, index-only design and start gathering performance data.

stanmoore1 · 2017-11-02T14:57:54Z

A third memory duplication approach would be to have permanently duplicated memory that has a new dimension of size nthreads and take a subview slice to get the original view. That approach has contiguous memory, doesn't require the memcpy, and uses less memory than option 1, and doesn't need an if statement in the access operator like option 2. This third option would likely be the fastest at runtime and a similar method is used in LAMMPS in the native OpenMP code, except that instead of adding a new dimension, one of the existing dimensions is extended by a factor of nthreads. The con is that the duplicated memory is long lived and cannot be dellocated around the kernel. The integration into an application code could also be more intrusive.

ibaned · 2017-11-02T14:59:38Z

A third memory duplication approach would be to have permanently duplicated memory that has a new dimension of size nthreads and take a subview slice to get the original view.

I see that more as an alternative use case of the nthreads-replicated design. In theory, I just have to provide a subview method of some sort to get the subview, and then it can be used that way.

stanmoore1 · 2017-11-02T15:13:15Z

@ibaned that's true.

ibaned · 2017-11-02T20:40:59Z

I've started drafting an implementation here:
https://github.com/kokkos/kokkos/blob/issue-825/containers/src/Kokkos_ReductionView.hpp

ibaned · 2017-11-10T21:25:25Z

The draft is complete and seems to be performing well.
Pull request #1225 is open with a full implementation.

ibaned added the Feature Request Create new capability; will potentially require voting label Jun 6, 2017

hcedwar added this to the Backlog milestone Jun 14, 2017

ibaned self-assigned this Jun 14, 2017

ibaned modified the milestones: Backlog, 2017 December Oct 11, 2017

ibaned mentioned this issue Oct 17, 2017

Scatter-add parallel pattern #1171

Closed

ibaned mentioned this issue Nov 10, 2017

ScatterView #1225

Merged

stanmoore1 mentioned this issue Dec 8, 2017

WIP: Add duplicated memory to force_lj ECP-copa/ExaMiniMD#21

Open

crtrott added the InDevelop label Dec 9, 2017

crtrott closed this as completed Dec 15, 2017

crtrott mentioned this issue Dec 15, 2017

Kokkos + KokkosKernels Promotion To Version 2.5.00 trilinos/Trilinos#2078

Merged

9 tasks

Add data-duplicated (thread/team private) Kokkos View #825

Add data-duplicated (thread/team private) Kokkos View #825

Comments

stanmoore1 commented May 19, 2017 • edited

stanmoore1 commented May 19, 2017 • edited

hcedwar commented Jun 14, 2017

ibaned commented Jun 14, 2017

stanmoore1 commented Jun 14, 2017

stanmoore1 commented Jun 14, 2017 • edited

ibaned commented Jun 14, 2017 • edited

stanmoore1 commented Jun 14, 2017

stanmoore1 commented Jun 14, 2017

ibaned commented Jun 14, 2017

nmhamster commented Jun 15, 2017

mhoemmen commented Jun 15, 2017

nmhamster commented Jun 15, 2017

nmhamster commented Jun 15, 2017

crtrott commented Jun 15, 2017

ibaned commented Oct 11, 2017

hcedwar commented Oct 17, 2017

hcedwar commented Oct 17, 2017

hcedwar commented Oct 17, 2017

ibaned commented Oct 17, 2017

hcedwar commented Oct 17, 2017

ibaned commented Oct 17, 2017

hcedwar commented Oct 17, 2017

ibaned commented Oct 17, 2017

ibaned commented Oct 17, 2017

hcedwar commented Oct 18, 2017

ibaned commented Oct 18, 2017

stanmoore1 commented Oct 25, 2017

ibaned commented Oct 26, 2017

ibaned commented Oct 26, 2017

hcedwar commented Nov 1, 2017

stanmoore1 commented Nov 1, 2017

stanmoore1 commented Nov 1, 2017 • edited

ibaned commented Nov 1, 2017

stanmoore1 commented Nov 2, 2017

ibaned commented Nov 2, 2017

stanmoore1 commented Nov 2, 2017

ibaned commented Nov 2, 2017

ibaned commented Nov 10, 2017

stanmoore1 commented May 19, 2017 •

edited

stanmoore1 commented May 19, 2017 •

edited

stanmoore1 commented Jun 14, 2017 •

edited

ibaned commented Jun 14, 2017 •

edited

stanmoore1 commented Nov 1, 2017 •

edited