Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Custom reduction join via lambda argument #99

Closed
hcedwar opened this issue Oct 2, 2015 · 77 comments
Closed

Custom reduction join via lambda argument #99

hcedwar opened this issue Oct 2, 2015 · 77 comments
Assignees
Labels
Feature Request Create new capability; will potentially require voting
Milestone

Comments

@hcedwar
Copy link
Contributor

hcedwar commented Oct 2, 2015

E.g., parallel_reduce( Policy , base_lambda, join_lambda, return_value )

@hcedwar hcedwar added the Feature Request Create new capability; will potentially require voting label Oct 2, 2015
@crtrott
Copy link
Member

crtrott commented Oct 2, 2015

Can you give us an example where a scan operation which doesn't do += is useful?
So far we didn't have a single one, and we don't want to implement and maintain something if its not practical useful.

@dholladay00
Copy link

Agreed.

My main issue was just finding examples/documentation of the parallel scan pattern with a lambda argument.

@mrtupek
Copy link

mrtupek commented Oct 8, 2015

This feature would be a interest to us as it would allow us to do a min reduce without having to create a functor.

@crtrott
Copy link
Member

crtrott commented Oct 31, 2015

Question: in order to provide an initial value, do you rather give a third lambda, or should we use the "in-value" of the return argument as the initial value. For example for a multiplication:

double total_product;
parallel_reduce(N,KOKKOS_LAMBDA (const int i, double& product) {
  product*=i;
}, 
[] (volatile double& value, const volatile double& update) { value *= update;},
[] (double& value) {value=1.0},
total_product);

Or:

double total_product = 1.0;
parallel_reduce(N,KOKKOS_LAMBDA (const int i, double& product) {
  product*=i;
}, 
[] (volatile double& value, const volatile double& update) { value *= update;},
total_product);

@jeffamelang
Copy link

I would prefer the second version. For the first version (and the currently-existing methods), the initial value of total_product is silently ignored. Users who are familiar with openmp expect this initial value to be used. For example, my students (to whom I'm just starting to talk about Kokkos right after OpenMP and TBB) were aghast to find that Kokkos ignored that value. I see no benefit to ignoring the value, and it's more error-prone (in my opinion).

@nmhamster
Copy link
Contributor

I agreed with Jeff that the second one seems to be a more logic approach. It's likely that initilization of the value will be earlier in the function anyway since most programmers are tuned to that.

@nmhamster
Copy link
Contributor

Have you looked into how this may mesh with user defined reductions in OpenMP? If we can get an alignment this might give us some better level of optimization. The use of volatiles everywhere is likely to mean limited vectorization.

@hcedwar
Copy link
Contributor Author

hcedwar commented Nov 23, 2015

Option for custom reduction not in the functor.
parallel_reduce( policy , functor , customizer );
customizer = reduce( return_reference , join_lambda , init_lambda );

@nmhamster
Copy link
Contributor

I know I'm asking for the world on a stick what I'd /really/ love is ..

my_variable = parallel_reduce(policy, functor, join_lambda, init_lambda)

@hcedwar hcedwar added this to the GTC 2016 milestone Nov 23, 2015
@hcedwar
Copy link
Contributor Author

hcedwar commented Nov 23, 2015

No, you want:
Future = parallel_reduce( ... );

From: Si Hammond <notifications@github.commailto:notifications@github.com>
Reply-To: kokkos/kokkos <reply@reply.github.commailto:reply@reply.github.com>
Date: Monday, November 23, 2015 at 11:34 AM
To: kokkos/kokkos <kokkos@noreply.github.commailto:kokkos@noreply.github.com>
Cc: Harold Edwards <hcedwar@sandia.govmailto:hcedwar@sandia.gov>
Subject: [EXTERNAL] Re: [kokkos] Custom reduction join via lambda argument (#99)

I know I'm asking for the world on a stick what I'd /really/ love is ..

my_variable = parallel_reduce(policy, functor, join_lambda, init_lambda)

Reply to this email directly or view it on GitHubhttps://github.com//issues/99#issuecomment-159022066.

@nmhamster
Copy link
Contributor

Yes but I figured one step at a time. That's what I'd really like please.

@nmhamster
Copy link
Contributor

Oops.. clicked wrong button. Sorry. When this is in the C++XX standard, is this going to work with std::async?

@nmhamster nmhamster reopened this Nov 23, 2015
@crtrott
Copy link
Member

crtrott commented Jan 7, 2016

Another idea is to do follow an intel proposal with something like this:

Kokkos::parallel_reduce(N,[=](const int &i, double& thread_local_result)
  {...BODY...},
  Kokkos::reduction(result_place,init_value,OperationType())
);

Effectively reduction would return a struct which knows where to put the final result (result_place, which could be a reference to a double, or a double* or a Kokkos::View of a double), how to initialize thread local contributions, and what the combination operation is.
That OperationType could be something like std::plus, but we could also allow a join lambda to go there, and we could predefine in Kokkos some operations: Kokkos::[plus/min/max/minus].

@hcedwar
Copy link
Contributor Author

hcedwar commented Jan 11, 2016

Based upon recent telecon with NVIDIA, allowing lambda declarations within the "reduction" clause is extremely problematic. For portability the "reduction" clause needs to be a functor that defines the join, init, and result processing/placement. We can provide helpers that would cover the "90-99%" use cases (assuming += is the 90% use case) of min/max.

@hcedwar hcedwar self-assigned this Jan 11, 2016
@nmhamster
Copy link
Contributor

Can we have two variants?
(1) Future my_variable = parallel_reduce(policy, functor, join_lambda, init_lambda)
(2) Future my_variable = parallel_reduce(policy, functor, join_lambda, literal value of T)

@hcedwar
Copy link
Contributor Author

hcedwar commented Jan 11, 2016

For NVIDIA portability can't do : parallel_reduce( policy , work_closure , join_lambda, init_lambda )
Have to do instead : parallel_reduce( policy , work_closure , reduction_functor )
We can provide a small suite of helper "reduction_functor" to make it easy for the "90-90%" use cases beyond the most common { += , 0 }.

@ndellingwood ndellingwood assigned gmackey and unassigned hcedwar Mar 16, 2016
@ndellingwood ndellingwood modified the milestones: Spring 2016, GTC 2016 Mar 16, 2016
@hcedwar hcedwar assigned crtrott and unassigned gmackey Apr 27, 2016
@hcedwar
Copy link
Contributor Author

hcedwar commented Apr 27, 2016

The interface for a reduction specialization is to move the current specialization code into a separate class.

struct MyReductionSpecialization {
  typedef ... value_type ;
  KOKKOS_INLINE_FUNCTION
  void join( volatile value_type & , const volatile value_type & ) const ;
  KOKKOS_INLINE_FUNCTION
  void init( value_type & ) const ;
  // Optional:
  KOKKOS_INLINE_FUNCTION
  void final( value_type & ) const ; 
};

Then provide helpers for the common reduction operations

Kokkos::Sum<T>
Kokkos::Max<T>
Kokkos::Min<T>

The parallel_reduce for a maximum becomes

Kokkos::parallel_reduce( policy , lambda , Kokkos::Max<double>() , result );

@crtrott
Copy link
Member

crtrott commented May 26, 2016

After reviewing the current implementation we decided to make this part of an overhaul of the internals of parallel_reduce which is getting really out of hand in its complexity with its many overloads. Therefore this is getting postponed.

@nmhamster
Copy link
Contributor

I don't believe that the POWER8 (v2.07 ISA) has horizontal instructions but I'm surprised the XL compiler cannot vectorize this code without some nicety around a shuffle to perform the final collapse. Assuming no downside for performance, it would be better to have simple reductions cast as above so where there is compiler assistance for handling reductions (as in Intel and I think newer GCC) we gain the benefit.

@nmhamster
Copy link
Contributor

By the way, my AVX512 test is a near identical code I wrote but reads N from argc to avoid optimizing anything out. I think the results will be the same.

@crtrott
Copy link
Member

crtrott commented Jun 14, 2016

The AVX512 code with Intel 17 vectorizes the Kokkos variant as well.

@nmhamster
Copy link
Contributor

GCC 4.7.4 will compile and vectorize this provided -ffast-math is included in the compile flags.

$ g++ -fopenmp -O3 -mavx2 -o reduce.gnu reduce.c -ffast-math

Produces compiler vectorization output of:

Analyzing loop at reduce.c:14

Analyzing loop at reduce.c:16

16: vect_model_induction_cost: inside_cost = 1, outside_cost = 2 .
16: vect_model_simple_cost: inside_cost = 1, outside_cost = 0 .
16: vect_model_reduction_cost: inside_cost = 2, outside_cost = 3 .
16: vect_model_simple_cost: inside_cost = 1, outside_cost = 1 .
16: cost model: epilogue peel iters set to vf/2 because loop iterations are unknown .
16: Cost model analysis:
  Vector inside of loop cost: 4
  Vector outside of loop cost: 22
  Scalar iteration cost: 3
  Scalar outside cost: 6
  prologue iterations: 0
  epilogue iterations: 4
  Calculated minimum iters for profitability: 6

16:   Profitability threshold = 7


Vectorizing loop at reduce.c:16

16: Profitability threshold is 7 loop iterations.
16: LOOP VECTORIZED.
reduce.c:14: note: vectorized 1 loops in function.

@nmhamster
Copy link
Contributor

I think we will need to talk to IBM about why XL does not what to vectorize the code example and then see what the reason is. It may just be a cost model thing in the end although that's clearly not what the optimization report is provided as feedback.

@nmhamster
Copy link
Contributor

So we are agreed that simple reductions should utilize #pragma omp parallel for reduction(..) as a construct then versus atomics etc?

@nmhamster
Copy link
Contributor

OK, well let's put it this way, when we are happy to bypass determinism, otherwise we would resort to local collapse and then deterministic tree reduce between threads?

@crtrott
Copy link
Member

crtrott commented Jun 14, 2016

I don't think that we agreed on that yet :-). So far everything is using a deterministic tree reduce.
The problem with the omp reduction is that I am not sure that it supports C++ data types the same way we do right now. So if we want to do it we probably have to add another enable_if layer in the OpenMP backend to switch between tree reductions and calling that. Its certainly possible, though. What are the restrictions on using a omp parallel for reduction(+:value) for value? Any idea?

@hcedwar
Copy link
Contributor Author

hcedwar commented Jun 14, 2016

We have deterministic reductions on all statically scheduled execution policies on all backends by scheduling ranges to threads and performing a deterministic inter-thread reduce. I don't see a need to add complexity of dispatching to the built in OpenMP reduce for trivial types and our own for non-trivial types unless it is known to perform better than our own reduce.

@crtrott
Copy link
Member

crtrott commented Jun 14, 2016

Also I'd like to see some performance analysis where the benefit is shown to switching to that.

@nmhamster
Copy link
Contributor

OK, I know what this issue is with IBM XL. The problem is that the summation variable is of type double but the code adds type int (my test code also did the same). Because of the cast (which is not vectorizable on POWER) the loop cannot be vectorized. If I change the code to below, I get SIMD instruction generation.

  #pragma omp parallel for reduction(+:value)
  for(int i=0; i<N; i++) {
    value += 1.05;
  }

Compiler report for the code sequence you pasted in the comments:

1586-542 (I) Loop (loop index 1 with nest-level 0 and iteration count 100) at reduce.cpp <line 15> was SIMD vectorized.
1586-543 (I) <SIMD info> Total number of the innermost loops considered <"1">. Total number of the innermost loops SIMD vectorized <"1">.

@nmhamster
Copy link
Contributor

You cannot use reduction clause for generic C++ types unless you move to the OpenMP4 user-defined reductions if I recall correctly (support is there but not without some work). In general the thought process is that the compound reduction should be more easily optimized. Agreed that we should do some work to ensure this is producing faster code.

@crtrott
Copy link
Member

crtrott commented Jun 14, 2016

I just ran this on KNL. The raw OpenMP as posted above is about 5% slower (consistently) when the Kokkos reduction using Intel 17. I.e.

double value = 0;
Kokkos::Impl::Timer timer;
Kokkos::parallel_reduce(N,KOKKOS_LAMDBA (const int& i, double& lsum) {
  lsum+=i;
},value);
double time1 = timer.seconds();
value = 0;
timer.reset();
#pragma omp parallel for reduction(+:value)
for(int i=0; i<N; i++)
  value+=i; 
double time2 = timer.seconds();

I used N = 1000000000 and the times were 5.2e-3 vs 5.5e-3 variance in runtime was smaller than the difference of the two times (with the Kokkos variant having less variance). This is with 256 threads and KMP_AFFINITY=compact.

@nmhamster
Copy link
Contributor

Sorry I can't parse your sentence, the raw OpenMP is 5% slower than Kokkos?

@crtrott
Copy link
Member

crtrott commented Jun 14, 2016

Yes, I will collect some histogram now.

@nmhamster
Copy link
Contributor

Can you try setting N to be much smaller and then run repeated reductions?

@dsunder
Copy link
Contributor

dsunder commented Jun 14, 2016

Just to add noise to this thread, a few months ago we were using OpenMP reductions in Uintah and we saw about a 5% improvement when we switched to using Kokkos reductions with dynamic scheduling (static scheduling was slightly slower than OpenMP).

--Dan

On Jun 14, 2016, at 15:09, Christian Trott notifications@github.com wrote:

I just ran this on KNL. The raw OpenMP as posted above is about 5% slower (consistently) when the Kokkos reduction using Intel 17. I.e.

double value = 0;
Kokkos::Impl::Timer timer;
Kokkos::parallel_reduce(N,KOKKOS_LAMDBA (const int& i, double& lsum) {
lsum+=i;
},value);
double time1 = timer.seconds();
value = 0;
timer.reset();
#pragma omp parallel for reduction(+:value)
for(int i=0; i<N; i++)
value+=i;
double time2 = timer.seconds();
I used N = 1000000000 and the times were 5.2e-3 vs 5.5e-3 variance in runtime was smaller than the difference of the two times (with the Kokkos variant having less variance). This is with 256 threads and KMP_AFFINITY=compact.


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or mute the thread.

@crtrott
Copy link
Member

crtrott commented Jun 14, 2016

OK I will send some data to you via email for KNL (or is the NDA lifted now?).

@hcedwar
Copy link
Contributor Author

hcedwar commented Jun 14, 2016

So using OpenMP native reduction would complicate the code an in special circumstances yield a negligible performance gain. Until there is evidence to the contrary the native OpenMP reduction option is mothballed.

@nmhamster
Copy link
Contributor

NDA is still in place until further notice.

@mhoemmen
Copy link
Contributor

mhoemmen commented Jun 14, 2016

@dsunder @nmhamster @crtrott Dan might count as a Sandian. I'll e-mail you with details.

@dsunder
Copy link
Contributor

dsunder commented Jun 14, 2016

Only to my Sandia email please :)

--Dan

On Tue, Jun 14, 2016 at 3:27 PM, Mark Hoemmen notifications@github.com
wrote:

@dsunder https://github.com/dsunder @nmhamster
https://github.com/nmhamster Dan might count as a Sandian. I'll
e-mail you with details.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#99 (comment), or mute
the thread
https://github.com/notifications/unsubscribe/AIMHEkuWjrGRUzdyY2i7PvSue8e74wkTks5qLxzLgaJpZM4GIICH
.

@crtrott
Copy link
Member

crtrott commented Jun 15, 2016

Btw. I just ran into another ugly issue. The return argument must be taken by reference if it is a scalar, but must be taken by value if it is a View or Reducer type in order to support inline construction.
But that means a whole lot more enable_if logic, because I need to distinguish between those things early on. That btw. also excludes implementation with variadic templates at the first level since. Doing this brings the compile time up by about 6% or so, which as a consequence means it takes now longer than the old implementation. Btw. the old implementation did not support inline construction of a view i.e. something like:

parallel_reduce(N,lambda,Kokkos::View<double*,Kokkos::MemoryUnmanaged>(data,K));

where data is some ptr to place the result values in.

@nmhamster
Copy link
Contributor

So we are about flat for compile time or is the 6% worse than the original, or is this an increase of 6% and we are something <6% worse than the original?

@crtrott
Copy link
Member

crtrott commented Jun 15, 2016

We are now worse than the original, with the main extra cost coming from the support of inline construction for the reducer or a return view. The problem is that forces me to do some more template enable ifs of the entry functions to distinguish between arguments which can and must be taken by reference and arguments which can (and must) be taken by value. I.e. all the entry functions now need:

std::enable_if< (!is_view<ReturnArgument>::value && !is_reducer<ReturnArgument>::value>::type* = 0

or

std::enable_if< (is_view<ReturnArgument>::value || is_reducer<ReturnArgument>::value>::type* = 0

This in particular costs adding enable if logic to functions which didn't need it before and I believe as a consequence it takes much longer to figure out which one to take.

@crtrott
Copy link
Member

crtrott commented Jun 15, 2016

I should add just supporting reducer got us back where we started. I.e. I first improved compile times by abotu 2% than I lost about 3% again to support reducers, and then I lost another 4-5% to support inline construction.

@crtrott
Copy link
Member

crtrott commented Jul 1, 2016

This is now in develop. I do not have Kokkos provided reducers, but you can write your own.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature Request Create new capability; will potentially require voting
Projects
None yet
Development

No branches or pull requests