Kokkos::sort defaults to std::sort on host #1208

mhoemmen · 2017-11-02T17:28:36Z

Thus, no OpenMP parallelism. Options to fix:

Use C++17 parallel algorithm extensions, if available
Implement an OpenMP parallel sort, e.g., https://software.intel.com/en-us/articles/a-parallel-stable-sort-using-c11-for-tbb-cilk-plus-and-openmp
???

mhoemmen · 2017-11-02T17:29:18Z

@vbrunini

nmhamster · 2017-11-02T18:27:35Z

Could use GNU parallel sort. It's very fast and hits their OpenMP runtime. Maybe specialise by compiler?

mhoemmen · 2017-11-02T19:05:26Z

@nmhamster How new of a GCC does that require?

etphipp · 2017-11-02T19:46:02Z

In other projects, I’ve had good success with the Intel PSS Mark listed.

nmhamster · 2017-11-02T19:47:17Z

@mhoemmen - I think its fairly old. Our experience was that this was pretty good for performance. I'm fairly sure it was at least in GCC 4.7 and we are now moving beyond that right?

mhoemmen · 2017-11-02T19:59:09Z

@etphipp wrote:

In other projects, I’ve had good success with the Intel PSS Mark listed.

It has the usual modified BSD license, from what I can tell. Which version did you use? I'm looking at the OpenMP tasks version now. Would using raw OpenMP tasks break Kokkos? The code does not set the number of threads, etc..

@nmhamster wrote:

I'm fairly sure it was at least in GCC 4.7 and we are now moving beyond that right?

Minimum GCC version for us is 4.9.3. Thanks!

mhoemmen · 2017-11-02T20:09:16Z

OK, let me at this. I'll first plug in GCC's parallel sort if available. It uses OpenMP, so this should be conditional on the OpenMP execution space.

ibaned · 2017-11-02T20:34:07Z

I've used the PSS code as well, and have a cleaned up version of it here:

https://github.com/ibaned/omega_h/tree/master/src/intel_sort

I use that for OpenMP, thrust::sort for CUDA, and std::sort for serial.

ibaned · 2017-11-02T20:34:59Z

If you could do a performance comparison of GCC's sort and the PSS code, I'd be really interested in the results.

mhoemmen · 2017-11-02T21:37:20Z

@ibaned Thanks! :-D I plan to plug in the "Technical Specification for C++ Extensions for Parallelism" too / first. Do you have experience with those?

ibaned · 2017-11-02T21:38:48Z

no... I assume thats closely related to the GCC parallel sort, but I've never tried calling those (needed something I was sure was on every compiler).

mhoemmen · 2017-11-02T22:02:31Z

The only issue with the C++ TS is that it does not promise whether it uses OpenMP or C++ threads, so perhaps I shouldn't plug that in until I learn more. GCC's __gcc_parallel::* algorithms promise to use OpenMP (and require it).

etphipp · 2017-11-03T14:58:04Z

I'm using the same logic as Dan in my tensor code that needs sorting (Thrust for Cuda, PSS for OpenMP, ...):

https://gitlab.com/tensors/genten/blob/master/src/Genten_Sptensor_perm.cpp

I've generally found Thrust to be faster than Kokkos' sort. I have not tried to use GNU. I have the Intel PSS included in that library (I believe I actually got the code from Dan's library):

https://gitlab.com/tensors/genten/blob/master/src/parallel_stable_sort.hpp

And there was no issue including this in our library that was released under BSD.

mhoemmen · 2017-11-03T15:31:54Z

@etphipp thanks! I think what I would like to do, is plug in all the sorts I can get and try them out. The challenge is figuring out good benchmarks.

We'll also need a "sort array x and apply permutation to array y" function (what Tpetra calls "sort2") at least. Jon Clausen tried Kokkos::sort and found out that since it is a bin sort, it wants to do arithmetic on entries. It also lacks an interface for extracting an integer key from an entry. Thus, we'll need to modify it anyway.

ibaned · 2017-11-03T15:44:12Z

Two general comments:

I think I made functional changes to PSS (cutoff numbers), so its entirely possible I degraded performance for some scenario, although not the ones I cared about at the time
Thrust should have an OpenMP backend for its sort as well, that may also be worth exploring.

In terms of bechmarks, one case that I think both my code and Tpetra care about is:

global indices from one partition of an overlapped map (CRS column map) of a typical partitioned finite element problem. these are all unique, but not necessarily contiguous. They can probably be easily generated using real or synthetic matrices. They should be 64 bit signed integers.
It might be interesting to contrast performance between this data set and a data set of the same size and type with completely random numbers.

Good data set sizes for both would be 10^3, 10^4, 10^5, 10^6. Albany tends to run about 10^5 finite elements per MPI rank, Alexa more like 10^6, while LAMMPS will sometimes run as few as 10^3 particles per MPI rank.

Extracting the permutation is important to me too, I reuse this permutation on a dozen different arrays after the sort.

mhoemmen · 2017-11-03T18:42:34Z

I'm currently blocked by this issue: #1212

I'll figure that one out :-)

mhoemmen · 2017-11-06T21:41:07Z

In terms of Trilinos development, I can work around #1212. In fact, I already did: trilinos/Trilinos#1946 (is closed). However, it would be nice to have #1212 fixed (it requires someone to approve the PR #1213).

mhoemmen · 2017-11-08T00:06:06Z

PR #1213 got approved. Thus, this issue is no longer blocked.

mhoemmen · 2017-11-10T05:33:09Z

I have a patch ready for this, that calls __gnu_parallel::sort if appropriate and if using OpenMP. I need to test with a compiler that does not claim to be GCC (Clang also does; the patch works fine there!).

mhoemmen · 2017-11-10T17:14:03Z

GCC, Clang, and Intel all provide __gnu_parallel::sort. It's actually fast with the Intel compiler! That's not crazy, because Intel on Linux at least links against the GNU headers. I'll need to find a compiler that does not define __GNUC__. I think even PGI defines that macro now and attempts GCC compatibility, so this might be quite a challenge. How about XL?

ibaned · 2017-12-13T20:27:09Z

This is represented by pull request #1226, which we'll look at during the next promotion cycle (February)

mhoemmen · 2017-12-14T04:08:08Z

@ibaned I haven't had a chance to test on all supported platforms yet, so I'm glad y'all are waiting :-) .

Fix #1208 by using __gnu_parallel::sort if available

ibaned · 2018-02-05T14:39:43Z

Changing assignee to @crtrott since he ended up doing all the testing

mhoemmen · 2018-02-05T15:53:07Z

Thanks @ibaned !

Revert "Fix #1208 by using __gnu_parallel::sort if available"

ibaned added the Enhancement Improve existing capability; will potentially require voting label Nov 6, 2017

ibaned added this to the 2018 February milestone Nov 6, 2017

ibaned self-assigned this Dec 13, 2017

hcedwar mentioned this issue Aug 11, 2017

Can Kokkos includes detect mismatched versions? #1031

Closed

crtrott added a commit that referenced this issue Feb 5, 2018

Merge pull request #1389 from mhoemmen/Fix-1208-a

7e62647

Fix #1208 by using __gnu_parallel::sort if available

crtrott added the InDevelop label Feb 5, 2018

crtrott assigned mhoemmen Feb 5, 2018

ibaned removed their assignment Feb 5, 2018

ibaned assigned crtrott Feb 5, 2018

ndellingwood mentioned this issue Feb 5, 2018

Kokkos_Sort error with Trilinos: 'float std::tr1::fabs(float)' conflicts with a previous declaration #1391

Closed

crtrott added a commit that referenced this issue Feb 8, 2018

Revert "Fix #1208 by using __gnu_parallel::sort if available"

04c8e15

crtrott added a commit that referenced this issue Feb 9, 2018

Merge pull request #1401 from kokkos/revert-1389-Fix-1208-a

bf7831f

Revert "Fix #1208 by using __gnu_parallel::sort if available"

ndellingwood closed this as completed Mar 7, 2018

ndellingwood mentioned this issue Mar 8, 2018

Kokkos + KokkosKernels Promotion To Version 2.6.00 trilinos/Trilinos#2351

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kokkos::sort defaults to std::sort on host #1208

Kokkos::sort defaults to std::sort on host #1208

mhoemmen commented Nov 2, 2017

mhoemmen commented Nov 2, 2017

nmhamster commented Nov 2, 2017

mhoemmen commented Nov 2, 2017

etphipp commented Nov 2, 2017

nmhamster commented Nov 2, 2017

mhoemmen commented Nov 2, 2017 •

edited

mhoemmen commented Nov 2, 2017

ibaned commented Nov 2, 2017

ibaned commented Nov 2, 2017

mhoemmen commented Nov 2, 2017

ibaned commented Nov 2, 2017

mhoemmen commented Nov 2, 2017 •

edited

etphipp commented Nov 3, 2017

mhoemmen commented Nov 3, 2017 •

edited

ibaned commented Nov 3, 2017

mhoemmen commented Nov 3, 2017

mhoemmen commented Nov 6, 2017

mhoemmen commented Nov 8, 2017

mhoemmen commented Nov 10, 2017

mhoemmen commented Nov 10, 2017

ibaned commented Dec 13, 2017

mhoemmen commented Dec 14, 2017

ibaned commented Feb 5, 2018

mhoemmen commented Feb 5, 2018

Kokkos::sort defaults to std::sort on host #1208

Kokkos::sort defaults to std::sort on host #1208

Comments

mhoemmen commented Nov 2, 2017

mhoemmen commented Nov 2, 2017

nmhamster commented Nov 2, 2017

mhoemmen commented Nov 2, 2017

etphipp commented Nov 2, 2017

nmhamster commented Nov 2, 2017

mhoemmen commented Nov 2, 2017 • edited

mhoemmen commented Nov 2, 2017

ibaned commented Nov 2, 2017

ibaned commented Nov 2, 2017

mhoemmen commented Nov 2, 2017

ibaned commented Nov 2, 2017

mhoemmen commented Nov 2, 2017 • edited

etphipp commented Nov 3, 2017

mhoemmen commented Nov 3, 2017 • edited

ibaned commented Nov 3, 2017

mhoemmen commented Nov 3, 2017

mhoemmen commented Nov 6, 2017

mhoemmen commented Nov 8, 2017

mhoemmen commented Nov 10, 2017

mhoemmen commented Nov 10, 2017

ibaned commented Dec 13, 2017

mhoemmen commented Dec 14, 2017

ibaned commented Feb 5, 2018

mhoemmen commented Feb 5, 2018

mhoemmen commented Nov 2, 2017 •

edited

mhoemmen commented Nov 2, 2017 •

edited

mhoemmen commented Nov 3, 2017 •

edited