Custom reduction join via lambda argument #99

hcedwar · 2015-10-02T17:28:13Z

E.g., parallel_reduce( Policy , base_lambda, join_lambda, return_value )

crtrott · 2015-10-02T22:29:03Z

Can you give us an example where a scan operation which doesn't do += is useful?
So far we didn't have a single one, and we don't want to implement and maintain something if its not practical useful.

dholladay00 · 2015-10-03T00:01:38Z

Agreed.

My main issue was just finding examples/documentation of the parallel scan pattern with a lambda argument.

mrtupek · 2015-10-08T18:02:38Z

This feature would be a interest to us as it would allow us to do a min reduce without having to create a functor.

crtrott · 2015-10-31T05:47:39Z

Question: in order to provide an initial value, do you rather give a third lambda, or should we use the "in-value" of the return argument as the initial value. For example for a multiplication:

double total_product;
parallel_reduce(N,KOKKOS_LAMBDA (const int i, double& product) {
  product*=i;
}, 
[] (volatile double& value, const volatile double& update) { value *= update;},
[] (double& value) {value=1.0},
total_product);

Or:

double total_product = 1.0;
parallel_reduce(N,KOKKOS_LAMBDA (const int i, double& product) {
  product*=i;
}, 
[] (volatile double& value, const volatile double& update) { value *= update;},
total_product);

jeffamelang · 2015-10-31T16:37:50Z

I would prefer the second version. For the first version (and the currently-existing methods), the initial value of total_product is silently ignored. Users who are familiar with openmp expect this initial value to be used. For example, my students (to whom I'm just starting to talk about Kokkos right after OpenMP and TBB) were aghast to find that Kokkos ignored that value. I see no benefit to ignoring the value, and it's more error-prone (in my opinion).

nmhamster · 2015-10-31T16:58:02Z

I agreed with Jeff that the second one seems to be a more logic approach. It's likely that initilization of the value will be earlier in the function anyway since most programmers are tuned to that.

nmhamster · 2015-10-31T17:30:40Z

Have you looked into how this may mesh with user defined reductions in OpenMP? If we can get an alignment this might give us some better level of optimization. The use of volatiles everywhere is likely to mean limited vectorization.

hcedwar · 2015-11-23T18:31:28Z

Option for custom reduction not in the functor.
parallel_reduce( policy , functor , customizer );
customizer = reduce( return_reference , join_lambda , init_lambda );

nmhamster · 2015-11-23T18:34:46Z

I know I'm asking for the world on a stick what I'd /really/ love is ..

my_variable = parallel_reduce(policy, functor, join_lambda, init_lambda)

hcedwar · 2015-11-23T18:38:33Z

No, you want:
Future = parallel_reduce( ... );

From: Si Hammond <notifications@github.com mailto:notifications@github.com>
Reply-To: kokkos/kokkos <reply@reply.github.com mailto:reply@reply.github.com>
Date: Monday, November 23, 2015 at 11:34 AM
To: kokkos/kokkos <kokkos@noreply.github.com mailto:kokkos@noreply.github.com>
Cc: Harold Edwards <hcedwar@sandia.gov mailto:hcedwar@sandia.gov>
Subject: [EXTERNAL] Re: [kokkos] Custom reduction join via lambda argument (#99)

I know I'm asking for the world on a stick what I'd /really/ love is ..

my_variable = parallel_reduce(policy, functor, join_lambda, init_lambda)

Reply to this email directly or view it on GitHubhttps://github.com//issues/99#issuecomment-159022066.

nmhamster · 2015-11-23T18:39:23Z

Yes but I figured one step at a time. That's what I'd really like please.

nmhamster · 2015-11-23T18:40:16Z

Oops.. clicked wrong button. Sorry. When this is in the C++XX standard, is this going to work with std::async?

crtrott · 2016-01-07T21:34:43Z

Another idea is to do follow an intel proposal with something like this:

Kokkos::parallel_reduce(N,[=](const int &i, double& thread_local_result)
  {...BODY...},
  Kokkos::reduction(result_place,init_value,OperationType())
);

Effectively reduction would return a struct which knows where to put the final result (result_place, which could be a reference to a double, or a double* or a Kokkos::View of a double), how to initialize thread local contributions, and what the combination operation is.
That OperationType could be something like std::plus, but we could also allow a join lambda to go there, and we could predefine in Kokkos some operations: Kokkos::[plus/min/max/minus].

hcedwar · 2016-01-11T16:48:30Z

Based upon recent telecon with NVIDIA, allowing lambda declarations within the "reduction" clause is extremely problematic. For portability the "reduction" clause needs to be a functor that defines the join, init, and result processing/placement. We can provide helpers that would cover the "90-99%" use cases (assuming += is the 90% use case) of min/max.

nmhamster · 2016-01-11T16:53:49Z

Can we have two variants?
(1) Future my_variable = parallel_reduce(policy, functor, join_lambda, init_lambda)
(2) Future my_variable = parallel_reduce(policy, functor, join_lambda, literal value of T)

hcedwar · 2016-01-11T17:01:54Z

For NVIDIA portability can't do : parallel_reduce( policy , work_closure , join_lambda, init_lambda )
Have to do instead : parallel_reduce( policy , work_closure , reduction_functor )
We can provide a small suite of helper "reduction_functor" to make it easy for the "90-90%" use cases beyond the most common { += , 0 }.

hcedwar · 2016-04-27T19:44:44Z

The interface for a reduction specialization is to move the current specialization code into a separate class.

struct MyReductionSpecialization {
  typedef ... value_type ;
  KOKKOS_INLINE_FUNCTION
  void join( volatile value_type & , const volatile value_type & ) const ;
  KOKKOS_INLINE_FUNCTION
  void init( value_type & ) const ;
  // Optional:
  KOKKOS_INLINE_FUNCTION
  void final( value_type & ) const ; 
};

Then provide helpers for the common reduction operations

Kokkos::Sum<T>
Kokkos::Max<T>
Kokkos::Min<T>

The parallel_reduce for a maximum becomes

Kokkos::parallel_reduce( policy , lambda , Kokkos::Max<double>() , result );

crtrott · 2016-05-26T17:09:59Z

After reviewing the current implementation we decided to make this part of an overhaul of the internals of parallel_reduce which is getting really out of hand in its complexity with its many overloads. Therefore this is getting postponed.

nmhamster · 2016-06-14T20:43:24Z

I don't believe that the POWER8 (v2.07 ISA) has horizontal instructions but I'm surprised the XL compiler cannot vectorize this code without some nicety around a shuffle to perform the final collapse. Assuming no downside for performance, it would be better to have simple reductions cast as above so where there is compiler assistance for handling reductions (as in Intel and I think newer GCC) we gain the benefit.

nmhamster · 2016-06-14T20:44:16Z

By the way, my AVX512 test is a near identical code I wrote but reads N from argc to avoid optimizing anything out. I think the results will be the same.

crtrott · 2016-06-14T20:50:06Z

The AVX512 code with Intel 17 vectorizes the Kokkos variant as well.

nmhamster · 2016-06-14T20:50:32Z

GCC 4.7.4 will compile and vectorize this provided -ffast-math is included in the compile flags.

$ g++ -fopenmp -O3 -mavx2 -o reduce.gnu reduce.c -ffast-math

Produces compiler vectorization output of:

Analyzing loop at reduce.c:14

Analyzing loop at reduce.c:16

16: vect_model_induction_cost: inside_cost = 1, outside_cost = 2 .
16: vect_model_simple_cost: inside_cost = 1, outside_cost = 0 .
16: vect_model_reduction_cost: inside_cost = 2, outside_cost = 3 .
16: vect_model_simple_cost: inside_cost = 1, outside_cost = 1 .
16: cost model: epilogue peel iters set to vf/2 because loop iterations are unknown .
16: Cost model analysis:
  Vector inside of loop cost: 4
  Vector outside of loop cost: 22
  Scalar iteration cost: 3
  Scalar outside cost: 6
  prologue iterations: 0
  epilogue iterations: 4
  Calculated minimum iters for profitability: 6

16:   Profitability threshold = 7


Vectorizing loop at reduce.c:16

16: Profitability threshold is 7 loop iterations.
16: LOOP VECTORIZED.
reduce.c:14: note: vectorized 1 loops in function.

nmhamster · 2016-06-14T20:52:01Z

I think we will need to talk to IBM about why XL does not what to vectorize the code example and then see what the reason is. It may just be a cost model thing in the end although that's clearly not what the optimization report is provided as feedback.

nmhamster · 2016-06-14T20:53:13Z

So we are agreed that simple reductions should utilize #pragma omp parallel for reduction(..) as a construct then versus atomics etc?

nmhamster · 2016-06-14T20:54:09Z

OK, well let's put it this way, when we are happy to bypass determinism, otherwise we would resort to local collapse and then deterministic tree reduce between threads?

crtrott · 2016-06-14T21:00:26Z

I don't think that we agreed on that yet :-). So far everything is using a deterministic tree reduce.
The problem with the omp reduction is that I am not sure that it supports C++ data types the same way we do right now. So if we want to do it we probably have to add another enable_if layer in the OpenMP backend to switch between tree reductions and calling that. Its certainly possible, though. What are the restrictions on using a omp parallel for reduction(+:value) for value? Any idea?

hcedwar · 2016-06-14T21:00:42Z

We have deterministic reductions on all statically scheduled execution policies on all backends by scheduling ranges to threads and performing a deterministic inter-thread reduce. I don't see a need to add complexity of dispatching to the built in OpenMP reduce for trivial types and our own for non-trivial types unless it is known to perform better than our own reduce.

crtrott · 2016-06-14T21:00:52Z

Also I'd like to see some performance analysis where the benefit is shown to switching to that.

nmhamster · 2016-06-14T21:05:43Z

OK, I know what this issue is with IBM XL. The problem is that the summation variable is of type double but the code adds type int (my test code also did the same). Because of the cast (which is not vectorizable on POWER) the loop cannot be vectorized. If I change the code to below, I get SIMD instruction generation.

  #pragma omp parallel for reduction(+:value)
  for(int i=0; i<N; i++) {
    value += 1.05;
  }

Compiler report for the code sequence you pasted in the comments:

1586-542 (I) Loop (loop index 1 with nest-level 0 and iteration count 100) at reduce.cpp <line 15> was SIMD vectorized.
1586-543 (I) <SIMD info> Total number of the innermost loops considered <"1">. Total number of the innermost loops SIMD vectorized <"1">.

nmhamster · 2016-06-14T21:07:31Z

You cannot use reduction clause for generic C++ types unless you move to the OpenMP4 user-defined reductions if I recall correctly (support is there but not without some work). In general the thought process is that the compound reduction should be more easily optimized. Agreed that we should do some work to ensure this is producing faster code.

crtrott · 2016-06-14T21:09:57Z

I just ran this on KNL. The raw OpenMP as posted above is about 5% slower (consistently) when the Kokkos reduction using Intel 17. I.e.

double value = 0;
Kokkos::Impl::Timer timer;
Kokkos::parallel_reduce(N,KOKKOS_LAMDBA (const int& i, double& lsum) {
  lsum+=i;
},value);
double time1 = timer.seconds();
value = 0;
timer.reset();
#pragma omp parallel for reduction(+:value)
for(int i=0; i<N; i++)
  value+=i; 
double time2 = timer.seconds();

I used N = 1000000000 and the times were 5.2e-3 vs 5.5e-3 variance in runtime was smaller than the difference of the two times (with the Kokkos variant having less variance). This is with 256 threads and KMP_AFFINITY=compact.

nmhamster · 2016-06-14T21:12:06Z

Sorry I can't parse your sentence, the raw OpenMP is 5% slower than Kokkos?

crtrott · 2016-06-14T21:12:35Z

Yes, I will collect some histogram now.

nmhamster · 2016-06-14T21:13:30Z

Can you try setting N to be much smaller and then run repeated reductions?

dsunder · 2016-06-14T21:16:55Z

Just to add noise to this thread, a few months ago we were using OpenMP reductions in Uintah and we saw about a 5% improvement when we switched to using Kokkos reductions with dynamic scheduling (static scheduling was slightly slower than OpenMP).

--Dan

On Jun 14, 2016, at 15:09, Christian Trott notifications@github.com wrote:

I just ran this on KNL. The raw OpenMP as posted above is about 5% slower (consistently) when the Kokkos reduction using Intel 17. I.e.

double value = 0;
Kokkos::Impl::Timer timer;
Kokkos::parallel_reduce(N,KOKKOS_LAMDBA (const int& i, double& lsum) {
lsum+=i;
},value);
double time1 = timer.seconds();
value = 0;
timer.reset();
#pragma omp parallel for reduction(+:value)
for(int i=0; i<N; i++)
value+=i;
double time2 = timer.seconds();
I used N = 1000000000 and the times were 5.2e-3 vs 5.5e-3 variance in runtime was smaller than the difference of the two times (with the Kokkos variant having less variance). This is with 256 threads and KMP_AFFINITY=compact.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or mute the thread.

crtrott · 2016-06-14T21:24:33Z

OK I will send some data to you via email for KNL (or is the NDA lifted now?).

hcedwar · 2016-06-14T21:26:26Z

So using OpenMP native reduction would complicate the code an in special circumstances yield a negligible performance gain. Until there is evidence to the contrary the native OpenMP reduction option is mothballed.

nmhamster · 2016-06-14T21:26:57Z

NDA is still in place until further notice.

mhoemmen · 2016-06-14T21:27:37Z

@dsunder @nmhamster @crtrott Dan might count as a Sandian. I'll e-mail you with details.

dsunder · 2016-06-14T21:29:56Z

Only to my Sandia email please :)

--Dan

On Tue, Jun 14, 2016 at 3:27 PM, Mark Hoemmen notifications@github.com
wrote:

@dsunder https://github.com/dsunder @nmhamster
https://github.com/nmhamster Dan might count as a Sandian. I'll
e-mail you with details.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#99 (comment), or mute
the thread
https://github.com/notifications/unsubscribe/AIMHEkuWjrGRUzdyY2i7PvSue8e74wkTks5qLxzLgaJpZM4GIICH
.

crtrott · 2016-06-15T02:09:51Z

Btw. I just ran into another ugly issue. The return argument must be taken by reference if it is a scalar, but must be taken by value if it is a View or Reducer type in order to support inline construction.
But that means a whole lot more enable_if logic, because I need to distinguish between those things early on. That btw. also excludes implementation with variadic templates at the first level since. Doing this brings the compile time up by about 6% or so, which as a consequence means it takes now longer than the old implementation. Btw. the old implementation did not support inline construction of a view i.e. something like:

parallel_reduce(N,lambda,Kokkos::View<double*,Kokkos::MemoryUnmanaged>(data,K));

where data is some ptr to place the result values in.

nmhamster · 2016-06-15T05:59:37Z

So we are about flat for compile time or is the 6% worse than the original, or is this an increase of 6% and we are something <6% worse than the original?

crtrott · 2016-06-15T15:09:28Z

We are now worse than the original, with the main extra cost coming from the support of inline construction for the reducer or a return view. The problem is that forces me to do some more template enable ifs of the entry functions to distinguish between arguments which can and must be taken by reference and arguments which can (and must) be taken by value. I.e. all the entry functions now need:

std::enable_if< (!is_view<ReturnArgument>::value && !is_reducer<ReturnArgument>::value>::type* = 0

or

std::enable_if< (is_view<ReturnArgument>::value || is_reducer<ReturnArgument>::value>::type* = 0

This in particular costs adding enable if logic to functions which didn't need it before and I believe as a consequence it takes much longer to figure out which one to take.

crtrott · 2016-06-15T15:10:57Z

I should add just supporting reducer got us back where we started. I.e. I first improved compile times by abotu 2% than I lost about 3% again to support reducers, and then I lost another 4-5% to support inline construction.

crtrott · 2016-07-01T22:18:49Z

This is now in develop. I do not have Kokkos provided reducers, but you can write your own.

hcedwar added the Feature Request Create new capability; will potentially require voting label Oct 2, 2015

hcedwar mentioned this issue Nov 23, 2015

Runtime or compile time error when reduce functor's join is not properly specified as const member function or with volatile arguments #105

Closed

hcedwar added this to the GTC 2016 milestone Nov 23, 2015

nmhamster closed this as completed Nov 23, 2015

nmhamster reopened this Nov 23, 2015

hcedwar self-assigned this Jan 11, 2016

ndellingwood assigned gmackey and unassigned hcedwar Mar 16, 2016

ndellingwood modified the milestones: Spring 2016, GTC 2016 Mar 16, 2016

hcedwar assigned crtrott and unassigned gmackey Apr 27, 2016

hcedwar mentioned this issue Jun 15, 2016

Add closure-move interface to parallel dispatch functions #154

Closed

crtrott added the InDevelop label Jul 1, 2016

ndellingwood closed this as completed Jul 21, 2016

Custom reduction join via lambda argument #99

Custom reduction join via lambda argument #99

Comments

hcedwar commented Oct 2, 2015

crtrott commented Oct 2, 2015

dholladay00 commented Oct 3, 2015

mrtupek commented Oct 8, 2015

crtrott commented Oct 31, 2015

jeffamelang commented Oct 31, 2015

nmhamster commented Oct 31, 2015

nmhamster commented Oct 31, 2015

hcedwar commented Nov 23, 2015

nmhamster commented Nov 23, 2015

hcedwar commented Nov 23, 2015

nmhamster commented Nov 23, 2015

nmhamster commented Nov 23, 2015

crtrott commented Jan 7, 2016

hcedwar commented Jan 11, 2016

nmhamster commented Jan 11, 2016

hcedwar commented Jan 11, 2016

hcedwar commented Apr 27, 2016

crtrott commented May 26, 2016

nmhamster commented Jun 14, 2016

nmhamster commented Jun 14, 2016

crtrott commented Jun 14, 2016

nmhamster commented Jun 14, 2016

nmhamster commented Jun 14, 2016

nmhamster commented Jun 14, 2016

nmhamster commented Jun 14, 2016

crtrott commented Jun 14, 2016

hcedwar commented Jun 14, 2016

crtrott commented Jun 14, 2016

nmhamster commented Jun 14, 2016

nmhamster commented Jun 14, 2016

crtrott commented Jun 14, 2016

nmhamster commented Jun 14, 2016

crtrott commented Jun 14, 2016

nmhamster commented Jun 14, 2016

dsunder commented Jun 14, 2016

crtrott commented Jun 14, 2016

hcedwar commented Jun 14, 2016 • edited

nmhamster commented Jun 14, 2016

mhoemmen commented Jun 14, 2016 • edited

dsunder commented Jun 14, 2016

crtrott commented Jun 15, 2016

nmhamster commented Jun 15, 2016

crtrott commented Jun 15, 2016

crtrott commented Jun 15, 2016

crtrott commented Jul 1, 2016

hcedwar commented Jun 14, 2016 •

edited

mhoemmen commented Jun 14, 2016 •

edited