Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kokkos::sort defaults to std::sort on host #1208

Closed
mhoemmen opened this issue Nov 2, 2017 · 24 comments
Closed

Kokkos::sort defaults to std::sort on host #1208

mhoemmen opened this issue Nov 2, 2017 · 24 comments
Assignees
Labels
Enhancement Improve existing capability; will potentially require voting
Milestone

Comments

@mhoemmen
Copy link
Contributor

mhoemmen commented Nov 2, 2017

Thus, no OpenMP parallelism. Options to fix:

  1. Use C++17 parallel algorithm extensions, if available
  2. Implement an OpenMP parallel sort, e.g., https://software.intel.com/en-us/articles/a-parallel-stable-sort-using-c11-for-tbb-cilk-plus-and-openmp
  3. ???
@mhoemmen
Copy link
Contributor Author

mhoemmen commented Nov 2, 2017

@vbrunini

@nmhamster
Copy link
Contributor

Could use GNU parallel sort. It's very fast and hits their OpenMP runtime. Maybe specialise by compiler?

@mhoemmen
Copy link
Contributor Author

mhoemmen commented Nov 2, 2017

@nmhamster How new of a GCC does that require?

@etphipp
Copy link
Contributor

etphipp commented Nov 2, 2017

In other projects, I’ve had good success with the Intel PSS Mark listed.

@nmhamster
Copy link
Contributor

@mhoemmen - I think its fairly old. Our experience was that this was pretty good for performance. I'm fairly sure it was at least in GCC 4.7 and we are now moving beyond that right?

@mhoemmen
Copy link
Contributor Author

mhoemmen commented Nov 2, 2017

@etphipp wrote:

In other projects, I’ve had good success with the Intel PSS Mark listed.

It has the usual modified BSD license, from what I can tell. Which version did you use? I'm looking at the OpenMP tasks version now. Would using raw OpenMP tasks break Kokkos? The code does not set the number of threads, etc..

@nmhamster wrote:

I'm fairly sure it was at least in GCC 4.7 and we are now moving beyond that right?

Minimum GCC version for us is 4.9.3. Thanks!

@mhoemmen
Copy link
Contributor Author

mhoemmen commented Nov 2, 2017

OK, let me at this. I'll first plug in GCC's parallel sort if available. It uses OpenMP, so this should be conditional on the OpenMP execution space.

@ibaned
Copy link
Contributor

ibaned commented Nov 2, 2017

I've used the PSS code as well, and have a cleaned up version of it here:

https://github.com/ibaned/omega_h/tree/master/src/intel_sort

I use that for OpenMP, thrust::sort for CUDA, and std::sort for serial.

@ibaned
Copy link
Contributor

ibaned commented Nov 2, 2017

If you could do a performance comparison of GCC's sort and the PSS code, I'd be really interested in the results.

@mhoemmen
Copy link
Contributor Author

mhoemmen commented Nov 2, 2017

@ibaned Thanks! :-D I plan to plug in the "Technical Specification for C++ Extensions for Parallelism" too / first. Do you have experience with those?

@ibaned
Copy link
Contributor

ibaned commented Nov 2, 2017

no... I assume thats closely related to the GCC parallel sort, but I've never tried calling those (needed something I was sure was on every compiler).

@mhoemmen
Copy link
Contributor Author

mhoemmen commented Nov 2, 2017

The only issue with the C++ TS is that it does not promise whether it uses OpenMP or C++ threads, so perhaps I shouldn't plug that in until I learn more. GCC's __gcc_parallel::* algorithms promise to use OpenMP (and require it).

@etphipp
Copy link
Contributor

etphipp commented Nov 3, 2017

I'm using the same logic as Dan in my tensor code that needs sorting (Thrust for Cuda, PSS for OpenMP, ...):

https://gitlab.com/tensors/genten/blob/master/src/Genten_Sptensor_perm.cpp

I've generally found Thrust to be faster than Kokkos' sort. I have not tried to use GNU. I have the Intel PSS included in that library (I believe I actually got the code from Dan's library):

https://gitlab.com/tensors/genten/blob/master/src/parallel_stable_sort.hpp

And there was no issue including this in our library that was released under BSD.

@mhoemmen
Copy link
Contributor Author

mhoemmen commented Nov 3, 2017

@etphipp thanks! I think what I would like to do, is plug in all the sorts I can get and try them out. The challenge is figuring out good benchmarks.

We'll also need a "sort array x and apply permutation to array y" function (what Tpetra calls "sort2") at least. Jon Clausen tried Kokkos::sort and found out that since it is a bin sort, it wants to do arithmetic on entries. It also lacks an interface for extracting an integer key from an entry. Thus, we'll need to modify it anyway.

@ibaned
Copy link
Contributor

ibaned commented Nov 3, 2017

Two general comments:

  • I think I made functional changes to PSS (cutoff numbers), so its entirely possible I degraded performance for some scenario, although not the ones I cared about at the time
  • Thrust should have an OpenMP backend for its sort as well, that may also be worth exploring.

In terms of bechmarks, one case that I think both my code and Tpetra care about is:

  • global indices from one partition of an overlapped map (CRS column map) of a typical partitioned finite element problem. these are all unique, but not necessarily contiguous. They can probably be easily generated using real or synthetic matrices. They should be 64 bit signed integers.
  • It might be interesting to contrast performance between this data set and a data set of the same size and type with completely random numbers.

Good data set sizes for both would be 10^3, 10^4, 10^5, 10^6. Albany tends to run about 10^5 finite elements per MPI rank, Alexa more like 10^6, while LAMMPS will sometimes run as few as 10^3 particles per MPI rank.

Extracting the permutation is important to me too, I reuse this permutation on a dozen different arrays after the sort.

@mhoemmen
Copy link
Contributor Author

mhoemmen commented Nov 3, 2017

I'm currently blocked by this issue: #1212

I'll figure that one out :-)

@mhoemmen
Copy link
Contributor Author

mhoemmen commented Nov 6, 2017

In terms of Trilinos development, I can work around #1212. In fact, I already did: trilinos/Trilinos#1946 (is closed). However, it would be nice to have #1212 fixed (it requires someone to approve the PR #1213).

@ibaned ibaned added the Enhancement Improve existing capability; will potentially require voting label Nov 6, 2017
@ibaned ibaned added this to the 2018 February milestone Nov 6, 2017
@mhoemmen
Copy link
Contributor Author

mhoemmen commented Nov 8, 2017

PR #1213 got approved. Thus, this issue is no longer blocked.

@mhoemmen
Copy link
Contributor Author

I have a patch ready for this, that calls __gnu_parallel::sort if appropriate and if using OpenMP. I need to test with a compiler that does not claim to be GCC (Clang also does; the patch works fine there!).

@mhoemmen
Copy link
Contributor Author

GCC, Clang, and Intel all provide __gnu_parallel::sort. It's actually fast with the Intel compiler! That's not crazy, because Intel on Linux at least links against the GNU headers. I'll need to find a compiler that does not define __GNUC__. I think even PGI defines that macro now and attempts GCC compatibility, so this might be quite a challenge. How about XL?

@ibaned ibaned self-assigned this Dec 13, 2017
@ibaned
Copy link
Contributor

ibaned commented Dec 13, 2017

This is represented by pull request #1226, which we'll look at during the next promotion cycle (February)

@mhoemmen
Copy link
Contributor Author

@ibaned I haven't had a chance to test on all supported platforms yet, so I'm glad y'all are waiting :-) .

crtrott added a commit that referenced this issue Feb 5, 2018
Fix #1208 by using __gnu_parallel::sort if available
@ibaned ibaned removed their assignment Feb 5, 2018
@ibaned
Copy link
Contributor

ibaned commented Feb 5, 2018

Changing assignee to @crtrott since he ended up doing all the testing

@mhoemmen
Copy link
Contributor Author

mhoemmen commented Feb 5, 2018

Thanks @ibaned !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Improve existing capability; will potentially require voting
Projects
None yet
Development

No branches or pull requests

6 participants