-
Notifications
You must be signed in to change notification settings - Fork 407
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Custom reduction join via lambda argument #99
Comments
Can you give us an example where a scan operation which doesn't do += is useful? |
Agreed. My main issue was just finding examples/documentation of the parallel scan pattern with a lambda argument. |
This feature would be a interest to us as it would allow us to do a min reduce without having to create a functor. |
Question: in order to provide an initial value, do you rather give a third lambda, or should we use the "in-value" of the return argument as the initial value. For example for a multiplication: double total_product;
parallel_reduce(N,KOKKOS_LAMBDA (const int i, double& product) {
product*=i;
},
[] (volatile double& value, const volatile double& update) { value *= update;},
[] (double& value) {value=1.0},
total_product); Or: double total_product = 1.0;
parallel_reduce(N,KOKKOS_LAMBDA (const int i, double& product) {
product*=i;
},
[] (volatile double& value, const volatile double& update) { value *= update;},
total_product); |
I would prefer the second version. For the first version (and the currently-existing methods), the initial value of total_product is silently ignored. Users who are familiar with openmp expect this initial value to be used. For example, my students (to whom I'm just starting to talk about Kokkos right after OpenMP and TBB) were aghast to find that Kokkos ignored that value. I see no benefit to ignoring the value, and it's more error-prone (in my opinion). |
I agreed with Jeff that the second one seems to be a more logic approach. It's likely that initilization of the value will be earlier in the function anyway since most programmers are tuned to that. |
Have you looked into how this may mesh with user defined reductions in OpenMP? If we can get an alignment this might give us some better level of optimization. The use of volatiles everywhere is likely to mean limited vectorization. |
Option for custom reduction not in the functor. |
I know I'm asking for the world on a stick what I'd /really/ love is .. my_variable = parallel_reduce(policy, functor, join_lambda, init_lambda) |
No, you want: From: Si Hammond <notifications@github.commailto:notifications@github.com> I know I'm asking for the world on a stick what I'd /really/ love is .. my_variable = parallel_reduce(policy, functor, join_lambda, init_lambda) Reply to this email directly or view it on GitHubhttps://github.com//issues/99#issuecomment-159022066. |
Yes but I figured one step at a time. That's what I'd really like please. |
Oops.. clicked wrong button. Sorry. When this is in the C++XX standard, is this going to work with std::async? |
Another idea is to do follow an intel proposal with something like this: Kokkos::parallel_reduce(N,[=](const int &i, double& thread_local_result)
{...BODY...},
Kokkos::reduction(result_place,init_value,OperationType())
); Effectively reduction would return a struct which knows where to put the final result (result_place, which could be a reference to a double, or a double* or a Kokkos::View of a double), how to initialize thread local contributions, and what the combination operation is. |
Based upon recent telecon with NVIDIA, allowing lambda declarations within the "reduction" clause is extremely problematic. For portability the "reduction" clause needs to be a functor that defines the join, init, and result processing/placement. We can provide helpers that would cover the "90-99%" use cases (assuming += is the 90% use case) of min/max. |
Can we have two variants? |
For NVIDIA portability can't do : parallel_reduce( policy , work_closure , join_lambda, init_lambda ) |
The interface for a reduction specialization is to move the current specialization code into a separate class.
Then provide helpers for the common reduction operations
The
|
After reviewing the current implementation we decided to make this part of an overhaul of the internals of parallel_reduce which is getting really out of hand in its complexity with its many overloads. Therefore this is getting postponed. |
I don't believe that the POWER8 (v2.07 ISA) has horizontal instructions but I'm surprised the XL compiler cannot vectorize this code without some nicety around a shuffle to perform the final collapse. Assuming no downside for performance, it would be better to have simple reductions cast as above so where there is compiler assistance for handling reductions (as in Intel and I think newer GCC) we gain the benefit. |
By the way, my AVX512 test is a near identical code I wrote but reads |
The AVX512 code with Intel 17 vectorizes the Kokkos variant as well. |
GCC 4.7.4 will compile and vectorize this provided
Produces compiler vectorization output of:
|
I think we will need to talk to IBM about why XL does not what to vectorize the code example and then see what the reason is. It may just be a cost model thing in the end although that's clearly not what the optimization report is provided as feedback. |
So we are agreed that simple reductions should utilize |
OK, well let's put it this way, when we are happy to bypass determinism, otherwise we would resort to local collapse and then deterministic tree reduce between threads? |
I don't think that we agreed on that yet :-). So far everything is using a deterministic tree reduce. |
We have deterministic reductions on all statically scheduled execution policies on all backends by scheduling ranges to threads and performing a deterministic inter-thread reduce. I don't see a need to add complexity of dispatching to the built in OpenMP reduce for trivial types and our own for non-trivial types unless it is known to perform better than our own reduce. |
Also I'd like to see some performance analysis where the benefit is shown to switching to that. |
OK, I know what this issue is with IBM XL. The problem is that the summation variable is of type
Compiler report for the code sequence you pasted in the comments:
|
You cannot use |
I just ran this on KNL. The raw OpenMP as posted above is about 5% slower (consistently) when the Kokkos reduction using Intel 17. I.e. double value = 0;
Kokkos::Impl::Timer timer;
Kokkos::parallel_reduce(N,KOKKOS_LAMDBA (const int& i, double& lsum) {
lsum+=i;
},value);
double time1 = timer.seconds();
value = 0;
timer.reset();
#pragma omp parallel for reduction(+:value)
for(int i=0; i<N; i++)
value+=i;
double time2 = timer.seconds(); I used N = 1000000000 and the times were 5.2e-3 vs 5.5e-3 variance in runtime was smaller than the difference of the two times (with the Kokkos variant having less variance). This is with 256 threads and KMP_AFFINITY=compact. |
Sorry I can't parse your sentence, the raw OpenMP is 5% slower than Kokkos? |
Yes, I will collect some histogram now. |
Can you try setting |
Just to add noise to this thread, a few months ago we were using OpenMP reductions in Uintah and we saw about a 5% improvement when we switched to using Kokkos reductions with dynamic scheduling (static scheduling was slightly slower than OpenMP). --Dan
|
OK I will send some data to you via email for KNL (or is the NDA lifted now?). |
So using OpenMP native reduction would complicate the code an in special circumstances yield a negligible performance gain. Until there is evidence to the contrary the native OpenMP reduction option is mothballed. |
NDA is still in place until further notice. |
@dsunder @nmhamster @crtrott Dan might count as a Sandian. I'll e-mail you with details. |
Only to my Sandia email please :) --Dan On Tue, Jun 14, 2016 at 3:27 PM, Mark Hoemmen notifications@github.com
|
Btw. I just ran into another ugly issue. The return argument must be taken by reference if it is a scalar, but must be taken by value if it is a View or Reducer type in order to support inline construction. parallel_reduce(N,lambda,Kokkos::View<double*,Kokkos::MemoryUnmanaged>(data,K)); where data is some ptr to place the result values in.
|
So we are about flat for compile time or is the 6% worse than the original, or is this an increase of 6% and we are something <6% worse than the original? |
We are now worse than the original, with the main extra cost coming from the support of inline construction for the reducer or a return view. The problem is that forces me to do some more template enable ifs of the entry functions to distinguish between arguments which can and must be taken by reference and arguments which can (and must) be taken by value. I.e. all the entry functions now need: std::enable_if< (!is_view<ReturnArgument>::value && !is_reducer<ReturnArgument>::value>::type* = 0 or std::enable_if< (is_view<ReturnArgument>::value || is_reducer<ReturnArgument>::value>::type* = 0 This in particular costs adding enable if logic to functions which didn't need it before and I believe as a consequence it takes much longer to figure out which one to take. |
I should add just supporting reducer got us back where we started. I.e. I first improved compile times by abotu 2% than I lost about 3% again to support reducers, and then I lost another 4-5% to support inline construction. |
This is now in develop. I do not have Kokkos provided reducers, but you can write your own. |
E.g., parallel_reduce( Policy , base_lambda, join_lambda, return_value )
The text was updated successfully, but these errors were encountered: