Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Implement radix sort #12586

Merged
merged 2 commits into from
May 12, 2019
Merged

ENH: Implement radix sort #12586

merged 2 commits into from
May 12, 2019

Conversation

hameerabbasi
Copy link
Contributor

@hameerabbasi hameerabbasi commented Dec 18, 2018

Fixes #6050

This PR implements radix/counting sort for integers.

  • Implement radix sort.
  • Implement radix argsort.
  • Actually use radix sort/argsort.
  • Tests
  • Benchmarks
  • Add optimisations
  • Make radix sort default for integers/bools
  • Docs

cc: @mattip A review would be nice, along with next steps. Feel free to force-push if need be.

@mattip
Copy link
Member

mattip commented Dec 18, 2018

tests and benchmarks come before making radix the default for integers.

@hameerabbasi
Copy link
Contributor Author

Looks like I need to implement a generic npy_radixsort and npy_aradixsort going by the quicksort.c.src file... Can I have an example of how I'd dispatch based on dtype? I assume I have to do something with PyArray_DESCR.

@andravin
Copy link

Will you also support radix sort for floats? One can make this work by deriving keys from the floats, as documented here: https://github.com/eloj/radix-sorting#float-keys

@hameerabbasi
Copy link
Contributor Author

Will you also support radix sort for floats? One can make this work by deriving keys from the floats, as documented here: https://github.com/eloj/radix-sorting#float-keys

I can do float32/float64 easily but I'm not sure if there are integer equivalents to complex*/float80/float128. Those might require multiple passes and/or copies.

@mattip
Copy link
Member

mattip commented Dec 21, 2018

an example of how I'd dispatch based on dtype

You would have to check the dtypes in PyArray_ArgSort, PyArray_ArgPartition, PyArray_Sort, and PyArray_Partition. You can use the PyArray_ISINTEGER macro.

@hameerabbasi
Copy link
Contributor Author

You can use the PyArray_ISINTEGER macro.

I still need to check for bool and float/double. Also, I need to decide the exact method based on the exact dtype.

@mattip
Copy link
Member

mattip commented Dec 24, 2018

You attach the sort methods to the dtype, so this line will return NULL if that dtype cannot do radix sort

@hameerabbasi
Copy link
Contributor Author

Next, I think, comes attaching kind='radixsort' to the sort and argsort functions, writing tests, benchmarks and docs.

It'd also be nice to know where the default sort is decided, if the benchmarks show an improvement and we want to change it.

@mattip
Copy link
Member

mattip commented Dec 24, 2018

where the default sort is decided

Grep for PyArray_SortkindConverter and NPY_QUICKSORT. The first is used to convert a string to the proper enum. I think there are some other places that used NY_QUICKSORT directly.

Copy link

@andravin andravin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might need to zero-initialize cnt array before use.

numpy/core/src/npysort/radixsort.c.src Outdated Show resolved Hide resolved
numpy/core/src/npysort/radixsort.c.src Show resolved Hide resolved
@hameerabbasi
Copy link
Contributor Author

Okay, while this works, it's massively slower than quicksort (by about a factor of 3). I went back to the reference repo: https://github.com/eloj/radix-sorting

And it seems he's doing two things different:

  1. Multiple passes for 64-bit and more, along with a bit of OOP to decide the key function. Can be simulated by function pointer, but I highly doubt it's any faster/slower than what we're doing.
  2. In his benchmark, he excludes the cost of allocating the auxiliary array.

@mattip
Copy link
Member

mattip commented Dec 25, 2018

allocating the auxiliary array

Can you do this once for each int dtype and store it somewhere?

@hameerabbasi
Copy link
Contributor Author

hameerabbasi commented Dec 25, 2018

Can you do this once for each int dtype and store it somewhere?

It's the exact same size as the input array, so unfortunately not.

EDIT: Here's the original implementation. Everything besides the auxiliary array is on the stack, so we avoid allocations.

@andravin
Copy link

andravin commented Dec 25, 2018

Can you do this once for each int dtype and store it somewhere?

It's the exact same size as the input array, so unfortunately not.

EDIT: Here's the original implementation. Everything besides the auxiliary array is on the stack, so we avoid allocations.

Usually we should avoid using a malloc in a numeric function. Often it is better to change the API, as the original implementation you reference does, so that any workspace or output buffer is passed as an argument. The user manages the memory and potentially re-uses buffers instead of allocating them repeatedly.

The next level of abstraction, which hides memory management from the user but still affords reuse, is to wrap such an allocation-free inner function with one that uses a memory pool to allocate workspace/output buffers. This makes memory allocation very fast for frequently used buffer sizes.

Unfortunately, I am not familiar enough with numpy internals to understand if these design patterns are already in use elsewhere.

Also, this is a nice C/C++ idiom for zero-initialization of a heap allocated array:

size_t cnt0[256] = { 0 };

It might generate the same code as memset, but it is much cleaner.

[edited for clarity]

@andravin
Copy link

andravin commented Dec 25, 2018

Would be nice to have multi-core support so that all available memory bandwidth can be utilized.

Would be nice to try 11-bit (2048 entry) counters in radix-sort instead of 8-bit (256 entry). These require 3 passes for 32-bit keys instead of 4, but still fit in L1 cache.

Would be nice to have integer and float merge-sort that use SIMD instructions and compare with radix-sort.

This is a lot of stuff, I am just putting it is on the radar. Cheers.

@hameerabbasi
Copy link
Contributor Author

hameerabbasi commented Dec 26, 2018

The next level of abstraction, which hides memory management from the user but still affords reuse, is to wrap such an allocation-free inner function with one that uses a memory pool to allocate workspace/output buffers. This makes memory allocation very fast for frequently used buffer sizes.

From what I understand the pymalloc allocator already does this. @mattip, do we have the PyObjectArenaAllocator held somewhere or would I need to use the getter function?

EDIT: Transposing two loops and getting rid of some unnecessary arrays leads to a huge speedup: 2x over quicksort on a 10 ** 6-sized array.

I'll also add in some other optimisations from this implementation, which is the one the author uses in the benchmark.

@hameerabbasi
Copy link
Contributor Author

Usually we should avoid using a malloc in a numeric function. Often it is better to change the API, as the original implementation you reference does, so that any workspace or output buffer is passed as an argument. The user manages the memory and potentially re-uses buffers instead of allocating them repeatedly.

I'm not sure how much effort this would take, but the user doesn't usually manage buffers in NumPy, the exception being out kwargs, which are rarely used. I agree with the mixing sentiment, I was hoping to avoid this myself, but the sorting code is written in a way that would disallow this. The NOT_USED pointer is a PyArray... which means it can't be used for passing in the auxiliary array. I would need to change all sorts to change this.

Would be nice to have multi-core support so that all available memory bandwidth can be utilized.

In general, NumPy doesn't use multi-core anywhere in its codebase.

Would be nice to try 11-bit (2048 entry) counters in radix-sort instead of 8-bit (256 entry). These require 3 passes for 32-bit keys instead of 4, but still fit in L1 cache.

Apparently, the author of the original repo tried this and found it wasn't all it was hyped up to be. I'm assuming it has to do with byte alignment being more efficient.

Would be nice to have integer and float merge-sort that use SIMD instructions and compare with radix-sort.

This would either require different compiler flags or ASM code, which has no precedent in NumPy.

Also, this is a nice C/C++ idiom for zero-initialization of a heap allocated array:

size_t cnt0[256] = { 0 };

Done!

@jakirkham
Copy link
Contributor

@jschueller probably can give you some tips on how to structure the C code to get the compiler to make use of SIMD instructions.

@hameerabbasi
Copy link
Contributor Author

hameerabbasi commented Dec 26, 2018

I added more optimisations:

  1. It sorts by just the columns needed instead of all of them.
  2. It terminates early on already sorted.
  3. Bit twiddling instead of arithmetic should make operations faster.

The problem is somehow these don't work currently. :/

Edit: They do work. 😄 Apparently, Jupyter Lab needs a python setup.py build_ext -i before it imports the most recent version, whereas runtests.py has its own version.

Edit 2: CI is broken because of warnings, tests pass.

@andravin
Copy link

@jschueller probably can give you some tips on how to structure the C code to get the compiler to make use of SIMD instructions.

After some digging I found https://github.com/numpy/numpy/blob/master/numpy/core/src/umath/simd.inc.src

/*
 * This file is for the definitions of simd vectorized operations.
 *
 * Currently contains sse2 functions that are built on amd64, x32 or
 * non-generic builds (CFLAGS=-march=...)
 * In future it may contain other instruction sets like AVX or NEON detected
 * at runtime in which case it needs to be included indirectly via a file
 * compiled with special options (or use gcc target attributes) so the binary
 * stays portable.
 */

It uses the Intel intrinsics, which are basically a C API for SIMD instructions:
https://software.intel.com/sites/landingpage/IntrinsicsGuide/

Here is a paper that describes how to sort numbers using the 128-bit SSE intrinsics:
http://www.vldb.org/pvldb/1/1454171.pdf

So SIMD accelerated merge-sort appears possible by adding the vector sort primitives documented in the paper to umath/simd.inc.src.

@andravin
Copy link

andravin commented Dec 26, 2018

Would be nice to have multi-core support so that all available memory bandwidth can be utilized.

In general, NumPy doesn't use multi-core anywhere in its codebase.

I see. But NumPy already depends on multi-threaded BLAS libraries. I actually don't see why NumPy would not also enable multi-threading elsewhere using the same OpenMP style environment variables that BLAS libraries use for threading control. Default to 1 thread for backwards compatibility (otherwise users who have already built multi-processing into their applications will be impacted).

Edit: #11826 seeks a numpy API to control the number of BLAS threads, so that applications that are tuned for multi-processing can force it to 1 (the trick is there is no standard BLAS API for setting the number of threads, and there is no way to know which BLAS library that numpy is using). Of course this same API could also control the number of threads available to any numpy internal functions. I have yet to see that the lack of threading in numpy internals is a policy, rather historically most of the heavy lifting is deferred to BLAS libraries which are themselves multi-threaded.

@hameerabbasi hameerabbasi changed the title [WIP] Implement radix sort ENH: Implement radix sort Dec 27, 2018
@hameerabbasi
Copy link
Contributor Author

cc @mattip I think this is ready, a review would be awesome. I have to update docs and changelog, but otherwise this is ready.

@hameerabbasi
Copy link
Contributor Author

The test failures don't seem to be related to this PR, as a best guess.

@hameerabbasi
Copy link
Contributor Author

Benchmarks above. :-) It'd be nice to have it as an option, I agree. But the issue is, it's impossible to add a sort without breaking the ABI, currently this PR replaces mergesort (which is now timsort) for relevant dtypes.

@hameerabbasi
Copy link
Contributor Author

(I still have the branch around, just doing some housekeeping on PRs, I'd be willing to work on this if NumPy decided their ABI was okay to break)

@rgommers
Copy link
Member

rgommers commented Apr 8, 2019

I'd be willing to work on this if NumPy decided their ABI was okay to break)

not for this I'd say, however you can turn it around too - when we eventually break the ABI then that's a good moment to add radixsort

@ahaldane
Copy link
Member

ahaldane commented Apr 8, 2019

@charris
Copy link
Member

charris commented Apr 8, 2019

A test that seems to be missing from the benchmarks is a simple random array. I noted that when I put them in, so maybe we should add one. The current tests are for arrays that are already ordered to some extent, which is what timsort is designed to deal with.

@hameerabbasi
Copy link
Contributor Author

A test that seems to be missing from the benchmarks is a simple random array. I noted that when I put them in, so maybe we should add one. The current tests are for arrays that are already ordered to some extent, which is what timsort is designed to deal with.

Opened a PR for this, #13287.

@hameerabbasi
Copy link
Contributor Author

Here are the benchmarks for randomly-shuffled arrays:

This branch:

               merge   int16            ('random',)              157±0.6μs
               merge   int32            ('random',)               168±2μs
               merge   int64            ('random',)               200±4μs
               merge   int16            ('random',)              51.7±0.2μs
               merge   int32            ('random',)               99.2±6μs
               merge   int64            ('random',)              202±0.5μs
               merge   int16            ('random',)              10.5±0.1μs
               merge   int32            ('random',)              10.6±0.3μs
               merge   int64            ('random',)              8.96±0.4μs

master

               merge   int16            ('random',)               900±5μs
               merge   int32            ('random',)               893±20μs
               merge   int64            ('random',)               903±5μs
               merge   int16            ('random',)               507±9μs
               merge   int32            ('random',)               502±4μs
               merge   int64            ('random',)               522±6μs
               merge   int16            ('random',)              11.1±0.1μs
               merge   int32            ('random',)              9.74±0.1μs
               merge   int64            ('random',)             10.8±0.04μs

Which proves that radix sort is faster for truly random data (reversed is its worst case).

Now, there are two ways to look at this:

  1. Statistically, (assuming a randomly shuffled array), most arrays have little to no sorted area. (favors this PR)
  2. Practically, it could be that arrays aren't really randomly shuffled, and instead have sorted areas. (favors timsort)

Which is true? There is absolutely no way to know without some usage data. My instinct relies on the latter, as I think stable sorts are usually used to sort in succession (or preserve some kind of partial order), and so do have sorted areas.

In any case, this should be discussed.

@hameerabbasi hameerabbasi reopened this Apr 9, 2019
@hameerabbasi
Copy link
Contributor Author

Here's the full set of benchmarks:

This branch:

python runtests.py --bench sort
Building, see build.log...
Build OK
· No executable found for python 3.6
· Discovering benchmarks
· Running 4 total benchmarks (1 commits * 1 environments * 4 benchmarks)
[  0.00%] ·· Benchmarking existing-py_Users_hameerabbasi_anaconda_envs_numpy_bin_python
[ 12.50%] ··· Running (bench_function_base.Sort.time_argsort--)...
[ 50.00%] ··· Running (bench_function_base.Sort.time_sort_worst--).
[ 62.50%] ··· bench_function_base.Sort.time_argsort                           ok
[ 62.50%] ··· ======= ======= ================================ =============
                kind   dtype             array_type
              ------- ------- -------------------------------- -------------
               merge   int16            ('random',)               152±2μs
               merge   int16            ('ordered',)             14.2±0.3μs
               merge   int16           ('reversed',)             138±0.4μs
               merge   int16            ('uniform',)            13.9±0.08μs
               merge   int16        ('sorted_block', 10)          183±2μs
               merge   int16       ('sorted_block', 100)          174±2μs
               merge   int16       ('sorted_block', 1000)         135±1μs
               merge   int16       ('swapped_pair', 0.01)         142±3μs
               merge   int16       ('swapped_pair', 0.1)         147±0.2μs
               merge   int16       ('swapped_pair', 0.5)          153±2μs
               merge   int16   ('random_unsorted_area', 0.5)      137±1μs
               merge   int16   ('random_unsorted_area', 0.1)      144±2μs
               merge   int16   ('random_unsorted_area', 0.01)     148±1μs
               merge   int16        ('random_bubble', 1)          146±4μs
               merge   int16        ('random_bubble', 5)         140±0.4μs
               merge   int16       ('random_bubble', 10)          144±3μs
               merge   int32            ('random',)               164±1μs
               merge   int32            ('ordered',)            16.2±0.08μs
               merge   int32           ('reversed',)              155±5μs
               merge   int32            ('uniform',)             15.9±0.1μs
               merge   int32        ('sorted_block', 10)          211±9μs
               merge   int32       ('sorted_block', 100)          187±5μs
               merge   int32       ('sorted_block', 1000)         148±2μs
               merge   int32       ('swapped_pair', 0.01)        148±0.8μs
               merge   int32       ('swapped_pair', 0.1)          162±3μs
               merge   int32       ('swapped_pair', 0.5)          164±1μs
               merge   int32   ('random_unsorted_area', 0.5)      152±4μs
               merge   int32   ('random_unsorted_area', 0.1)      147±2μs
               merge   int32   ('random_unsorted_area', 0.01)     158±4μs
               merge   int32        ('random_bubble', 1)         155±0.8μs
               merge   int32        ('random_bubble', 5)          153±2μs
               merge   int32       ('random_bubble', 10)          149±3μs
               merge   int64            ('random',)               210±8μs
               merge   int64            ('ordered',)             16.2±0.2μs
               merge   int64           ('reversed',)              184±2μs
               merge   int64            ('uniform',)             16.7±0.2μs
               merge   int64        ('sorted_block', 10)          236±6μs
               merge   int64       ('sorted_block', 100)          217±1μs
               merge   int64       ('sorted_block', 1000)         180±1μs
               merge   int64       ('swapped_pair', 0.01)         195±7μs
               merge   int64       ('swapped_pair', 0.1)          194±4μs
               merge   int64       ('swapped_pair', 0.5)          191±7μs
               merge   int64   ('random_unsorted_area', 0.5)     186±0.6μs
               merge   int64   ('random_unsorted_area', 0.1)      190±5μs
               merge   int64   ('random_unsorted_area', 0.01)    194±0.6μs
               merge   int64        ('random_bubble', 1)          196±4μs
               merge   int64        ('random_bubble', 5)          193±3μs
               merge   int64       ('random_bubble', 10)          196±3μs
              ======= ======= ================================ =============

[ 75.00%] ··· bench_function_base.Sort.time_sort                              ok
[ 75.00%] ··· ======= ======= ================================ =============
                kind   dtype             array_type
              ------- ------- -------------------------------- -------------
               merge   int16            ('random',)               52.0±1μs
               merge   int16            ('ordered',)             14.0±0.3μs
               merge   int16           ('reversed',)             53.9±0.3μs
               merge   int16            ('uniform',)             14.4±0.1μs
               merge   int16        ('sorted_block', 10)          50.7±1μs
               merge   int16       ('sorted_block', 100)         52.2±0.9μs
               merge   int16       ('sorted_block', 1000)        54.5±0.8μs
               merge   int16       ('swapped_pair', 0.01)        54.6±0.2μs
               merge   int16       ('swapped_pair', 0.1)         53.4±0.5μs
               merge   int16       ('swapped_pair', 0.5)         51.1±0.5μs
               merge   int16   ('random_unsorted_area', 0.5)      52.5±1μs
               merge   int16   ('random_unsorted_area', 0.1)     57.0±0.7μs
               merge   int16   ('random_unsorted_area', 0.01)    63.0±0.7μs
               merge   int16        ('random_bubble', 1)          63.1±2μs
               merge   int16        ('random_bubble', 5)         57.7±0.7μs
               merge   int16       ('random_bubble', 10)          55.9±1μs
               merge   int32            ('random',)               94.3±5μs
               merge   int32            ('ordered',)             15.2±0.1μs
               merge   int32           ('reversed',)              123±10μs
               merge   int32            ('uniform',)             14.9±0.3μs
               merge   int32        ('sorted_block', 10)          131±2μs
               merge   int32       ('sorted_block', 100)          129±3μs
               merge   int32       ('sorted_block', 1000)         120±10μs
               merge   int32       ('swapped_pair', 0.01)        108±0.4μs
               merge   int32       ('swapped_pair', 0.1)          119±8μs
               merge   int32       ('swapped_pair', 0.5)         100±0.3μs
               merge   int32   ('random_unsorted_area', 0.5)      122±10μs
               merge   int32   ('random_unsorted_area', 0.1)      114±2μs
               merge   int32   ('random_unsorted_area', 0.01)     120±2μs
               merge   int32        ('random_bubble', 1)          118±2μs
               merge   int32        ('random_bubble', 5)          113±1μs
               merge   int32       ('random_bubble', 10)          113±2μs
               merge   int64            ('random',)               199±4μs
               merge   int64            ('ordered',)            14.4±0.06μs
               merge   int64           ('reversed',)              179±5μs
               merge   int64            ('uniform',)             15.0±0.2μs
               merge   int64        ('sorted_block', 10)          234±4μs
               merge   int64       ('sorted_block', 100)          221±9μs
               merge   int64       ('sorted_block', 1000)         177±3μs
               merge   int64       ('swapped_pair', 0.01)         179±2μs
               merge   int64       ('swapped_pair', 0.1)          189±3μs
               merge   int64       ('swapped_pair', 0.5)         194±0.6μs
               merge   int64   ('random_unsorted_area', 0.5)      182±2μs
               merge   int64   ('random_unsorted_area', 0.1)     177±0.9μs
               merge   int64   ('random_unsorted_area', 0.01)     181±2μs
               merge   int64        ('random_bubble', 1)          189±2μs
               merge   int64        ('random_bubble', 5)          185±6μs
               merge   int64       ('random_bubble', 10)          182±5μs
              ======= ======= ================================ =============

[ 87.50%] ··· bench_function_base.Sort.time_sort_inplace                      ok
[ 87.50%] ··· ======= ======= ================================ =============
                kind   dtype             array_type
              ------- ------- -------------------------------- -------------
               merge   int16            ('random',)             10.7±0.07μs
               merge   int16            ('ordered',)            10.4±0.07μs
               merge   int16           ('reversed',)             10.7±0.2μs
               merge   int16            ('uniform',)             10.3±0.1μs
               merge   int16        ('sorted_block', 10)        10.4±0.09μs
               merge   int16       ('sorted_block', 100)         10.5±0.3μs
               merge   int16       ('sorted_block', 1000)        10.2±0.1μs
               merge   int16       ('swapped_pair', 0.01)       10.4±0.07μs
               merge   int16       ('swapped_pair', 0.1)         10.6±0.2μs
               merge   int16       ('swapped_pair', 0.5)         10.6±0.2μs
               merge   int16   ('random_unsorted_area', 0.5)     10.7±0.4μs
               merge   int16   ('random_unsorted_area', 0.1)    10.4±0.03μs
               merge   int16   ('random_unsorted_area', 0.01)    10.7±0.1μs
               merge   int16        ('random_bubble', 1)         10.7±0.2μs
               merge   int16        ('random_bubble', 5)         10.3±0.2μs
               merge   int16       ('random_bubble', 10)         10.5±0.2μs
               merge   int32            ('random',)              10.7±0.2μs
               merge   int32            ('ordered',)            10.5±0.04μs
               merge   int32           ('reversed',)            10.5±0.07μs
               merge   int32            ('uniform',)             10.7±0.1μs
               merge   int32        ('sorted_block', 10)        10.4±0.03μs
               merge   int32       ('sorted_block', 100)         10.9±0.2μs
               merge   int32       ('sorted_block', 1000)       10.5±0.04μs
               merge   int32       ('swapped_pair', 0.01)        10.6±0.2μs
               merge   int32       ('swapped_pair', 0.1)         10.7±0.3μs
               merge   int32       ('swapped_pair', 0.5)         10.5±0.1μs
               merge   int32   ('random_unsorted_area', 0.5)     10.8±0.2μs
               merge   int32   ('random_unsorted_area', 0.1)     10.6±0.2μs
               merge   int32   ('random_unsorted_area', 0.01)    10.8±0.4μs
               merge   int32        ('random_bubble', 1)         10.6±0.1μs
               merge   int32        ('random_bubble', 5)         10.5±0.1μs
               merge   int32       ('random_bubble', 10)         10.7±0.1μs
               merge   int64            ('random',)              8.48±0.2μs
               merge   int64            ('ordered',)            8.30±0.05μs
               merge   int64           ('reversed',)             8.63±0.1μs
               merge   int64            ('uniform',)             8.36±0.2μs
               merge   int64        ('sorted_block', 10)        8.50±0.05μs
               merge   int64       ('sorted_block', 100)        8.48±0.09μs
               merge   int64       ('sorted_block', 1000)        8.64±0.2μs
               merge   int64       ('swapped_pair', 0.01)       8.44±0.08μs
               merge   int64       ('swapped_pair', 0.1)        8.46±0.02μs
               merge   int64       ('swapped_pair', 0.5)         9.47±0.1μs
               merge   int64   ('random_unsorted_area', 0.5)     8.51±0.2μs
               merge   int64   ('random_unsorted_area', 0.1)     8.94±0.5μs
               merge   int64   ('random_unsorted_area', 0.01)   8.50±0.05μs
               merge   int64        ('random_bubble', 1)         8.45±0.2μs
               merge   int64        ('random_bubble', 5)        8.32±0.05μs
               merge   int64       ('random_bubble', 10)        8.35±0.07μs
              ======= ======= ================================ =============

[100.00%] ··· bench_function_base.Sort.time_sort_worst                  96.1±2ms

master:

python runtests.py --bench sort
Building, see build.log...
Build OK
· No executable found for python 3.6
· Discovering benchmarks
· Running 4 total benchmarks (1 commits * 1 environments * 4 benchmarks)
[  0.00%] ·· Benchmarking existing-py_Users_hameerabbasi_anaconda_envs_numpy_bin_python
[ 12.50%] ··· Running (bench_function_base.Sort.time_argsort--)...
[ 50.00%] ··· Running (bench_function_base.Sort.time_sort_worst--).
[ 62.50%] ··· bench_function_base.Sort.time_argsort                           ok
[ 62.50%] ··· ======= ======= ================================ =============
                kind   dtype             array_type
              ------- ------- -------------------------------- -------------
               merge   int16            ('random',)               883±3μs
               merge   int16            ('ordered',)             16.2±0.1μs
               merge   int16           ('reversed',)             21.4±0.5μs
               merge   int16            ('uniform',)             15.8±0.1μs
               merge   int16        ('sorted_block', 10)          207±4μs
               merge   int16       ('sorted_block', 100)         146±0.4μs
               merge   int16       ('sorted_block', 1000)        78.1±0.9μs
               merge   int16       ('swapped_pair', 0.01)        133±0.4μs
               merge   int16       ('swapped_pair', 0.1)          364±5μs
               merge   int16       ('swapped_pair', 0.5)          727±3μs
               merge   int16   ('random_unsorted_area', 0.5)      172±2μs
               merge   int16   ('random_unsorted_area', 0.1)     35.6±0.1μs
               merge   int16   ('random_unsorted_area', 0.01)   18.4±0.04μs
               merge   int16        ('random_bubble', 1)         17.8±0.3μs
               merge   int16        ('random_bubble', 5)         24.7±0.2μs
               merge   int16       ('random_bubble', 10)         36.1±0.2μs
               merge   int32            ('random',)               885±20μs
               merge   int32            ('ordered',)            15.9±0.05μs
               merge   int32           ('reversed',)             21.8±0.4μs
               merge   int32            ('uniform',)             16.4±0.4μs
               merge   int32        ('sorted_block', 10)          207±7μs
               merge   int32       ('sorted_block', 100)          134±1μs
               merge   int32       ('sorted_block', 1000)        74.5±0.5μs
               merge   int32       ('swapped_pair', 0.01)         140±5μs
               merge   int32       ('swapped_pair', 0.1)          366±5μs
               merge   int32       ('swapped_pair', 0.5)          733±5μs
               merge   int32   ('random_unsorted_area', 0.5)      177±3μs
               merge   int32   ('random_unsorted_area', 0.1)      41.1±1μs
               merge   int32   ('random_unsorted_area', 0.01)    18.6±0.2μs
               merge   int32        ('random_bubble', 1)         18.3±0.3μs
               merge   int32        ('random_bubble', 5)         26.7±0.1μs
               merge   int32       ('random_bubble', 10)         41.4±0.9μs
               merge   int64            ('random',)               877±5μs
               merge   int64            ('ordered',)            16.6±0.09μs
               merge   int64           ('reversed',)             21.5±0.8μs
               merge   int64            ('uniform',)             16.7±0.3μs
               merge   int64        ('sorted_block', 10)          202±4μs
               merge   int64       ('sorted_block', 100)         134±0.9μs
               merge   int64       ('sorted_block', 1000)         72.7±1μs
               merge   int64       ('swapped_pair', 0.01)         138±2μs
               merge   int64       ('swapped_pair', 0.1)          366±7μs
               merge   int64       ('swapped_pair', 0.5)          773±8μs
               merge   int64   ('random_unsorted_area', 0.5)      174±2μs
               merge   int64   ('random_unsorted_area', 0.1)     40.0±0.8μs
               merge   int64   ('random_unsorted_area', 0.01)   18.1±0.08μs
               merge   int64        ('random_bubble', 1)         18.8±0.3μs
               merge   int64        ('random_bubble', 5)         24.7±0.4μs
               merge   int64       ('random_bubble', 10)         40.4±0.8μs
              ======= ======= ================================ =============

[ 75.00%] ··· bench_function_base.Sort.time_sort                              ok
[ 75.00%] ··· ======= ======= ================================ =============
                kind   dtype             array_type
              ------- ------- -------------------------------- -------------
               merge   int16            ('random',)               502±6μs
               merge   int16            ('ordered',)             14.5±0.2μs
               merge   int16           ('reversed',)            18.8±0.05μs
               merge   int16            ('uniform',)            14.4±0.06μs
               merge   int16        ('sorted_block', 10)          337±6μs
               merge   int16       ('sorted_block', 100)          255±5μs
               merge   int16       ('sorted_block', 1000)        141±0.8μs
               merge   int16       ('swapped_pair', 0.01)         256±2μs
               merge   int16       ('swapped_pair', 0.1)          390±2μs
               merge   int16       ('swapped_pair', 0.5)          490±8μs
               merge   int16   ('random_unsorted_area', 0.5)      136±1μs
               merge   int16   ('random_unsorted_area', 0.1)     36.2±0.6μs
               merge   int16   ('random_unsorted_area', 0.01)    17.1±0.2μs
               merge   int16        ('random_bubble', 1)         16.8±0.3μs
               merge   int16        ('random_bubble', 5)         25.0±0.5μs
               merge   int16       ('random_bubble', 10)         35.4±0.5μs
               merge   int32            ('random',)               515±10μs
               merge   int32            ('ordered',)            13.8±0.09μs
               merge   int32           ('reversed',)             20.3±0.2μs
               merge   int32            ('uniform',)             14.6±0.4μs
               merge   int32        ('sorted_block', 10)          341±2μs
               merge   int32       ('sorted_block', 100)          258±7μs
               merge   int32       ('sorted_block', 1000)        145±0.8μs
               merge   int32       ('swapped_pair', 0.01)         259±2μs
               merge   int32       ('swapped_pair', 0.1)          390±10μs
               merge   int32       ('swapped_pair', 0.5)          497±5μs
               merge   int32   ('random_unsorted_area', 0.5)     140±0.2μs
               merge   int32   ('random_unsorted_area', 0.1)     35.5±0.3μs
               merge   int32   ('random_unsorted_area', 0.01)    16.5±0.2μs
               merge   int32        ('random_bubble', 1)        16.0±0.09μs
               merge   int32        ('random_bubble', 5)         25.5±0.6μs
               merge   int32       ('random_bubble', 10)         36.4±0.2μs
               merge   int64            ('random',)               526±10μs
               merge   int64            ('ordered',)             17.6±0.3μs
               merge   int64           ('reversed',)            21.9±0.09μs
               merge   int64            ('uniform',)             17.4±0.2μs
               merge   int64        ('sorted_block', 10)          321±3μs
               merge   int64       ('sorted_block', 100)          262±4μs
               merge   int64       ('sorted_block', 1000)         149±4μs
               merge   int64       ('swapped_pair', 0.01)         275±5μs
               merge   int64       ('swapped_pair', 0.1)          388±4μs
               merge   int64       ('swapped_pair', 0.5)          505±6μs
               merge   int64   ('random_unsorted_area', 0.5)      142±1μs
               merge   int64   ('random_unsorted_area', 0.1)     40.4±0.5μs
               merge   int64   ('random_unsorted_area', 0.01)    19.3±0.5μs
               merge   int64        ('random_bubble', 1)         19.3±0.2μs
               merge   int64        ('random_bubble', 5)         27.3±0.2μs
               merge   int64       ('random_bubble', 10)         40.9±0.3μs
              ======= ======= ================================ =============

[ 87.50%] ··· bench_function_base.Sort.time_sort_inplace                      ok
[ 87.50%] ··· ======= ======= ================================ =============
                kind   dtype             array_type
              ------- ------- -------------------------------- -------------
               merge   int16            ('random',)              11.0±0.1μs
               merge   int16            ('ordered',)             10.7±0.2μs
               merge   int16           ('reversed',)             10.5±0.2μs
               merge   int16            ('uniform',)            10.6±0.08μs
               merge   int16        ('sorted_block', 10)         10.7±0.1μs
               merge   int16       ('sorted_block', 100)         10.8±0.3μs
               merge   int16       ('sorted_block', 1000)       10.3±0.05μs
               merge   int16       ('swapped_pair', 0.01)       10.9±0.08μs
               merge   int16       ('swapped_pair', 0.1)        10.6±0.08μs
               merge   int16       ('swapped_pair', 0.5)        10.8±0.08μs
               merge   int16   ('random_unsorted_area', 0.5)    10.3±0.09μs
               merge   int16   ('random_unsorted_area', 0.1)     10.5±0.2μs
               merge   int16   ('random_unsorted_area', 0.01)    10.4±0.2μs
               merge   int16        ('random_bubble', 1)        10.6±0.05μs
               merge   int16        ('random_bubble', 5)        10.5±0.08μs
               merge   int16       ('random_bubble', 10)         10.3±0.2μs
               merge   int32            ('random',)             9.64±0.04μs
               merge   int32            ('ordered',)             9.55±0.5μs
               merge   int32           ('reversed',)             9.72±0.5μs
               merge   int32            ('uniform',)            9.17±0.06μs
               merge   int32        ('sorted_block', 10)        9.92±0.08μs
               merge   int32       ('sorted_block', 100)         9.95±0.5μs
               merge   int32       ('sorted_block', 1000)        9.53±0.2μs
               merge   int32       ('swapped_pair', 0.01)       9.58±0.06μs
               merge   int32       ('swapped_pair', 0.1)        10.1±0.08μs
               merge   int32       ('swapped_pair', 0.5)        9.66±0.06μs
               merge   int32   ('random_unsorted_area', 0.5)     9.71±0.1μs
               merge   int32   ('random_unsorted_area', 0.1)    9.20±0.05μs
               merge   int32   ('random_unsorted_area', 0.01)   9.62±0.03μs
               merge   int32        ('random_bubble', 1)         9.63±0.5μs
               merge   int32        ('random_bubble', 5)        9.65±0.06μs
               merge   int32       ('random_bubble', 10)         9.23±0.1μs
               merge   int64            ('random',)             11.4±0.03μs
               merge   int64            ('ordered',)            10.4±0.06μs
               merge   int64           ('reversed',)             10.6±0.4μs
               merge   int64            ('uniform',)            10.3±0.02μs
               merge   int64        ('sorted_block', 10)         10.8±0.2μs
               merge   int64       ('sorted_block', 100)         11.0±0.2μs
               merge   int64       ('sorted_block', 1000)        10.8±0.3μs
               merge   int64       ('swapped_pair', 0.01)        10.8±0.2μs
               merge   int64       ('swapped_pair', 0.1)         10.9±0.2μs
               merge   int64       ('swapped_pair', 0.5)        10.9±0.04μs
               merge   int64   ('random_unsorted_area', 0.5)     10.8±0.2μs
               merge   int64   ('random_unsorted_area', 0.1)     10.5±0.2μs
               merge   int64   ('random_unsorted_area', 0.01)   10.4±0.07μs
               merge   int64        ('random_bubble', 1)        10.3±0.08μs
               merge   int64        ('random_bubble', 5)         10.4±0.1μs
               merge   int64       ('random_bubble', 10)        10.8±0.09μs
              ======= ======= ================================ =============

[100.00%] ··· bench_function_base.Sort.time_sort_worst                  99.6±2ms

@shoyer
Copy link
Member

shoyer commented Apr 9, 2019

My instinct relies on the latter, as I think stable sorts are usually used to sort in succession (or preserve some kind of partial order), and so do have sorted areas.

I also think this is most likely, so the best default is probably timsort.

That said, I do really like the idea of having an option for radix sort. It's too bad that this seems to break C API compatibility (I don't really understand the nuance of that yet).

@charris
Copy link
Member

charris commented Apr 9, 2019

We might also want to test int8. I don't know how often people sort that type, but it would seem to offer the best chance of good performance for radixsort. We should also fix the problems with the PyArray_ArrFuncs structure being setup by downstream users, which is why we cannot just add entries to the end like we do for other structures. See the discussion at #12945.

@shoyer
Copy link
Member

shoyer commented Apr 9, 2019

It looks like there are already some compelling speedups, e.g., by up to 10x for random int16 data. I don't think we need to test it on random int8 data to believe this would speed it up.

@hameerabbasi
Copy link
Contributor Author

not for this I'd say, however you can turn it around too - when we eventually break the ABI then that's a good moment to add radixsort

That's what I meant... I couldn't articulate it too well.

Becnhmarks (just for int8):

This PR:

python runtests.py --bench sort
Building, see build.log...
Build OK
· No executable found for python 3.6
· Discovering benchmarks
· Running 4 total benchmarks (1 commits * 1 environments * 4 benchmarks)
[  0.00%] ·· Benchmarking existing-py_Users_hameerabbasi_anaconda_envs_numpy_bin_python
[ 12.50%] ··· Running (bench_function_base.Sort.time_argsort--)...
[ 50.00%] ··· Running (bench_function_base.Sort.time_sort_worst--).
[ 62.50%] ··· bench_function_base.Sort.time_argsort                           ok
[ 62.50%] ··· ======= ======= ================================ =============
                kind   dtype             array_type
              ------- ------- -------------------------------- -------------
               merge    int8            ('random',)              7.66±0.7ms
               merge    int8            ('ordered',)            6.54±0.08ms
               merge    int8           ('reversed',)             6.77±0.4ms
               merge    int8            ('uniform',)             1.86±0.1ms
               merge    int8        ('sorted_block', 10)         8.39±0.7ms
               merge    int8       ('sorted_block', 100)         7.09±0.8ms
               merge    int8       ('sorted_block', 1000)        7.26±0.9ms
               merge    int8       ('swapped_pair', 0.01)        7.00±0.3ms
               merge    int8       ('swapped_pair', 0.1)         7.61±0.6ms
               merge    int8       ('swapped_pair', 0.5)         7.44±0.4ms
               merge    int8   ('random_unsorted_area', 0.5)     7.75±0.4ms
               merge    int8   ('random_unsorted_area', 0.1)     7.39±0.2ms
               merge    int8   ('random_unsorted_area', 0.01)    6.64±0.1ms
               merge    int8        ('random_bubble', 1)         6.82±0.3ms
               merge    int8        ('random_bubble', 5)        6.46±0.07ms
               merge    int8       ('random_bubble', 10)        6.56±0.07ms
              ======= ======= ================================ =============

[ 75.00%] ··· bench_function_base.Sort.time_sort                              ok
[ 75.00%] ··· ======= ======= ================================ =============
                kind   dtype             array_type
              ------- ------- -------------------------------- -------------
               merge    int8            ('random',)             2.51±0.04ms
               merge    int8            ('ordered',)            2.41±0.08ms
               merge    int8           ('reversed',)             2.38±0.1ms
               merge    int8            ('uniform',)              774±40μs
               merge    int8        ('sorted_block', 10)        2.39±0.08ms
               merge    int8       ('sorted_block', 100)        2.36±0.04ms
               merge    int8       ('sorted_block', 1000)       2.30±0.07ms
               merge    int8       ('swapped_pair', 0.01)        2.51±0.1ms
               merge    int8       ('swapped_pair', 0.1)        2.51±0.07ms
               merge    int8       ('swapped_pair', 0.5)        2.55±0.06ms
               merge    int8   ('random_unsorted_area', 0.5)    2.60±0.05ms
               merge    int8   ('random_unsorted_area', 0.1)    2.64±0.06ms
               merge    int8   ('random_unsorted_area', 0.01)   2.56±0.06ms
               merge    int8        ('random_bubble', 1)         2.49±0.1ms
               merge    int8        ('random_bubble', 5)        2.52±0.01ms
               merge    int8       ('random_bubble', 10)        2.53±0.02ms
              ======= ======= ================================ =============

[ 87.50%] ··· bench_function_base.Sort.time_sort_inplace                      ok
[ 87.50%] ··· ======= ======= ================================ ==========
                kind   dtype             array_type
              ------- ------- -------------------------------- ----------
               merge    int8            ('random',)             830±9μs
               merge    int8            ('ordered',)            815±8μs
               merge    int8           ('reversed',)            814±20μs
               merge    int8            ('uniform',)            697±9μs
               merge    int8        ('sorted_block', 10)        816±7μs
               merge    int8       ('sorted_block', 100)        810±10μs
               merge    int8       ('sorted_block', 1000)       790±5μs
               merge    int8       ('swapped_pair', 0.01)       831±20μs
               merge    int8       ('swapped_pair', 0.1)        820±9μs
               merge    int8       ('swapped_pair', 0.5)        823±8μs
               merge    int8   ('random_unsorted_area', 0.5)    815±3μs
               merge    int8   ('random_unsorted_area', 0.1)    807±5μs
               merge    int8   ('random_unsorted_area', 0.01)   805±9μs
               merge    int8        ('random_bubble', 1)        824±20μs
               merge    int8        ('random_bubble', 5)        836±6μs
               merge    int8       ('random_bubble', 10)        829±10μs
              ======= ======= ================================ ==========

[100.00%] ··· bench_function_base.Sort.time_sort_worst                   102±3ms

master:

python runtests.py --bench sort
Building, see build.log...
Build OK
· No executable found for python 3.6
· Discovering benchmarks
· Running 4 total benchmarks (1 commits * 1 environments * 4 benchmarks)
[  0.00%] ·· Benchmarking existing-py_Users_hameerabbasi_anaconda_envs_numpy_bin_python
[ 12.50%] ··· Running (bench_function_base.Sort.time_argsort--)..
[ 37.50%] ··· Running (bench_function_base.Sort.time_sort_inplace--)..
[ 62.50%] ··· bench_function_base.Sort.time_argsort                           ok
[ 62.50%] ··· ======= ======= ================================ =============
                kind   dtype             array_type
              ------- ------- -------------------------------- -------------
               merge    int8            ('random',)               94.6±2ms
               merge    int8            ('ordered',)              30.0±1ms
               merge    int8           ('reversed',)             30.2±0.6ms
               merge    int8            ('uniform',)            2.09±0.02ms
               merge    int8        ('sorted_block', 10)         64.8±0.8ms
               merge    int8       ('sorted_block', 100)         55.9±0.7ms
               merge    int8       ('sorted_block', 1000)        57.2±0.4ms
               merge    int8       ('swapped_pair', 0.01)        45.7±0.2ms
               merge    int8       ('swapped_pair', 0.1)          70.3±1ms
               merge    int8       ('swapped_pair', 0.5)          91.3±2ms
               merge    int8   ('random_unsorted_area', 0.5)     53.5±0.8ms
               merge    int8   ('random_unsorted_area', 0.1)     36.4±0.3ms
               merge    int8   ('random_unsorted_area', 0.01)    30.9±0.6ms
               merge    int8        ('random_bubble', 1)         30.3±0.6ms
               merge    int8        ('random_bubble', 5)          29.4±1ms
               merge    int8       ('random_bubble', 10)         30.4±0.7ms
              ======= ======= ================================ =============

[ 75.00%] ··· bench_function_base.Sort.time_sort                              ok
[ 75.00%] ··· ======= ======= ================================ =============
                kind   dtype             array_type
              ------- ------- -------------------------------- -------------
               merge    int8            ('random',)              83.2±0.5ms
               merge    int8            ('ordered',)             29.1±0.2ms
               merge    int8           ('reversed',)             29.3±0.2ms
               merge    int8            ('uniform',)            1.05±0.02ms
               merge    int8        ('sorted_block', 10)         62.1±0.3ms
               merge    int8       ('sorted_block', 100)         53.3±0.3ms
               merge    int8       ('sorted_block', 1000)        54.1±0.3ms
               merge    int8       ('swapped_pair', 0.01)        41.2±0.2ms
               merge    int8       ('swapped_pair', 0.1)         59.8±0.7ms
               merge    int8       ('swapped_pair', 0.5)         80.8±0.3ms
               merge    int8   ('random_unsorted_area', 0.5)     48.1±0.4ms
               merge    int8   ('random_unsorted_area', 0.1)     34.8±0.3ms
               merge    int8   ('random_unsorted_area', 0.01)    30.2±0.2ms
               merge    int8        ('random_bubble', 1)         29.3±0.7ms
               merge    int8        ('random_bubble', 5)         29.4±0.5ms
               merge    int8       ('random_bubble', 10)         29.6±0.2ms
              ======= ======= ================================ =============

[ 87.50%] ··· bench_function_base.Sort.time_sort_inplace                      ok
[ 87.50%] ··· ======= ======= ================================ =============
                kind   dtype             array_type
              ------- ------- -------------------------------- -------------
               merge    int8            ('random',)             8.43±0.05ms
               merge    int8            ('ordered',)            3.40±0.05ms
               merge    int8           ('reversed',)            3.30±0.01ms
               merge    int8            ('uniform',)              899±6μs
               merge    int8        ('sorted_block', 10)        5.96±0.04ms
               merge    int8       ('sorted_block', 100)        5.27±0.04ms
               merge    int8       ('sorted_block', 1000)       5.76±0.01ms
               merge    int8       ('swapped_pair', 0.01)       4.28±0.03ms
               merge    int8       ('swapped_pair', 0.1)        5.79±0.04ms
               merge    int8       ('swapped_pair', 0.5)        7.50±0.05ms
               merge    int8   ('random_unsorted_area', 0.5)    4.86±0.03ms
               merge    int8   ('random_unsorted_area', 0.1)    3.72±0.04ms
               merge    int8   ('random_unsorted_area', 0.01)   3.31±0.01ms
               merge    int8        ('random_bubble', 1)        3.25±0.02ms
               merge    int8        ('random_bubble', 5)        3.24±0.03ms
               merge    int8       ('random_bubble', 10)        3.49±0.03ms
              ======= ======= ================================ =============

[100.00%] ··· bench_function_base.Sort.time_sort_worst                   100±1ms

So you're not far off. 🙂

@hameerabbasi
Copy link
Contributor Author

hameerabbasi commented Apr 9, 2019

That said, I do really like the idea of having an option for radix sort. It's too bad that this seems to break C API compatibility (I don't really understand the nuance of that yet).

Not the API, but rather the ABI. It doesn't mean people have to change their code, but that code compiled for older versions of NumPy will no longer work for newer ones. This is because the positioning of some function pointers changed in a struct.

@charris
Copy link
Member

charris commented Apr 9, 2019

Looks to me that a good case can be made for radix sort for at least some of the dtypes.

@hameerabbasi hameerabbasi force-pushed the radix-sort branch 2 times, most recently from d4824a0 to 512e7dd Compare May 11, 2019 04:40
@hameerabbasi
Copy link
Contributor Author

I have rebased, fixed the doc error/warning, and incorporated @charris's change about using this only for small dtypes.

Testing:
- np.int8/np.uint8
- np.int16/np.uint16
- np.int32/np.uint32
- np.int64/np.uint64
- np.float32/np.float64
- np.float16/np.longdouble
@charris
Copy link
Member

charris commented May 12, 2019

close/reopen

@charris charris closed this May 12, 2019
@charris charris reopened this May 12, 2019
@charris charris merged commit 3bdc915 into numpy:master May 12, 2019
@charris
Copy link
Member

charris commented May 12, 2019

Thanks Hameer.

@jakirkham
Copy link
Contributor

Thanks for working on this Hameer. Glad to see it in.

I missed the last part about limiting radix sort to some smaller types, which types is it being used for?

@mattip
Copy link
Member

mattip commented May 12, 2019

It is this line which will choose radix sort for the first 5 types: npy_bool, npy_byte, npy_ubyte, npy_short, npy_ushort

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

argsort/sort with O(n) algorithms: counting-sort, radix-sort
10 participants