sort should have an algorithm with 'best' complexity for nearly sorted data #12186

jondo · 2018-10-16T19:16:11Z

I have got a large amount of data that is nearly sorted.
As e.g. nicely shown in these animations, a simple insertion sort could be the fastest in this case.

So I suggest offering this as an additional algorithm for the 'kind' argument.

(I also found a corresponding SO question.)

charris · 2018-10-16T19:35:10Z

Note that there are already insertion sort implementations in npysort, it is used for "small" sorts. It should not be to hard to take that code and make official sorting methods from it.

liwt31 · 2018-10-24T15:12:00Z

Can I take on this? This is a somewhat big enhancement, tasks include:

Implement an insertion sort in c.src
Update npy_sort.h and maybe use a template to generate the file
Add a new field in PyArray_ArrFunc and update arraytypes.c
Update ndarraytypes.h and conversion_utils.c for the new method.
Add tests and modify document

EDIT: it's not necessary to modify PyArray_ArrFunc directly, some tweaks around the fieldPyArray_SortFunc *sort[NPY_NSORTS] should be enough.

mattip · 2018-10-24T16:38:56Z

Why do you need to touch PyArray_ArrFunc? Isn't it enough to define a new [NPY_SORTKIND] and increment NPY_SORTS?

If you feel the PR is very large, you make a merge request of a partial implementation, mark it WIP, and add a checklist so we know when/what to review.

charris · 2018-10-24T16:47:09Z

This may a bit much for a first project involving a lot of code. You are, of course, welcome to give it a try.

liwt31 · 2018-10-24T22:33:55Z

Why do you need to touch PyArray_ArrFunc? Isn't it enough to define a new [NPY_SORTKIND] and increment NPY_SORTS?

Sorry, that was an inaccurate expression. This is what I meant originally.

This may a bit much for a first project involving a lot of code.

I admit solving the issue is quite challenging, so thank you all for your (forthcoming) kind advice.

juliantaylor · 2018-10-25T21:41:20Z

I'd start with some benchmarks first. Prepare some datasets and just replace the quicksort with an insertion sort locally for the benchmarks.
Adding a sorting algorithm breaks the ABI and we should now some real numbers on benefits before doing so.

charris · 2018-10-25T22:41:34Z

Quicksort falls back on insertion sort when the partition size is small, so all you need to do to test the timing is to change the cutoff to a bigger number in numpy/core/src/npysort/quicksort.c.src.

#define SMALL_QUICKSORT 15 /* change this to bigger number */

Quicksort will then be insertion sort for arrays smaller than that. The problem will be putting together relevant test data and deciding what "almost sorted" means. Note that writing insertion sort comes down to copying the quicksort code and deleting large chunks of it :)

charris · 2018-10-25T22:50:42Z

It might be possible to also speed up mergsort for the almost sorted case. Which, now that I think of it, is what Timsort does. Maybe we should finally implement that, it seems safer than a straight insertion sort and we have discussed adding it before.

charris · 2018-10-25T23:07:08Z

If Timsort is no worse than mergesort, we could simply replace mergesort under the covers, just like we did with quicksort, which is now introsort.

liwt31 · 2018-10-26T01:35:22Z

I found a lib to do some quick benchmarks, here are some results (array size 100000, average over 10 times, -O3 optimization).

Random generated array

Running tests
stdlib qsort time:                 15712.00 us per iteration
quick sort time:                    7235.20 us per iteration
selection sort time:             16801795.10 us per iteration
bubble sort time:                17992590.90 us per iteration
merge sort time:                    8862.10 us per iteration
binary insertion sort time:      1465175.50 us per iteration
heap sort time:                    10233.90 us per iteration
shell sort time:                   11486.70 us per iteration
tim sort time:                      9248.40 us per iteration
in-place merge sort time:           8205.80 us per iteration
grail sort time:                   10529.20 us per iteration
sqrt sort time:                     9159.90 us per iteration
rec stable sort sort time:         34215.50 us per iteration
grail sort dyn buffer sort time:    9527.50 us per iteration

Makes sense that quick sort is the fastest. Tim sort slightly worse than merge sort.

Completely sorted array

Running tests
stdlib qsort time:                  5419.30 us per iteration
quick sort time:                    1924.90 us per iteration
selection sort time:             3391344.20 us per iteration
bubble sort time:                     96.00 us per iteration
merge sort time:                    2016.40 us per iteration
binary insertion sort time:           67.00 us per iteration
heap sort time:                     8213.60 us per iteration
shell sort time:                    1480.10 us per iteration
tim sort time:                        53.90 us per iteration
in-place merge sort time:            154.90 us per iteration
grail sort time:                    2208.90 us per iteration
sqrt sort time:                     1689.30 us per iteration
rec stable sort sort time:          2118.00 us per iteration
grail sort dyn buffer sort time:    1749.30 us per iteration

Insertion sort is good, however the fastest is tim sort. Very impressive.

Almost sorted array with 0.05% permutation (all elements away from sorted position by averagely 0.05% of total size of the array, i.e 50 elements)

Running tests
stdlib qsort time:                  9476.00 us per iteration
quick sort time:                    5275.50 us per iteration
selection sort time:             3496961.80 us per iteration
bubble sort time:                  21221.50 us per iteration
merge sort time:                    5251.80 us per iteration
binary insertion sort time:         5690.50 us per iteration
heap sort time:                     9306.70 us per iteration
shell sort time:                    4523.20 us per iteration
tim sort time:                      5101.90 us per iteration
in-place merge sort time:           4103.30 us per iteration
grail sort time:                    5359.30 us per iteration
sqrt sort time:                     4110.00 us per iteration
rec stable sort sort time:         10509.20 us per iteration
grail sort dyn buffer sort time:    4799.10 us per iteration

Tim sort is not performing well with this kind of permutation, and insertion sort performs basically the same with tim sort. In-place merge sort becomes the fastest.

Almost sorted array with 5% permutation (all elements away from sorted position by averagely 5% of total size of the array, 5000 elements)

Running tests
stdlib qsort time:                 13776.50 us per iteration
quick sort time:                    7988.50 us per iteration
selection sort time:             4810939.30 us per iteration
bubble sort time:                1780895.10 us per iteration
merge sort time:                    7918.50 us per iteration
binary insertion sort time:       100704.50 us per iteration
heap sort time:                     8718.30 us per iteration
shell sort time:                    8775.20 us per iteration
tim sort time:                      7933.30 us per iteration
in-place merge sort time:           6256.20 us per iteration
grail sort time:                    8166.30 us per iteration
sqrt sort time:                     6699.00 us per iteration
rec stable sort sort time:         25762.10 us per iteration
grail sort dyn buffer sort time:    7437.60 us per iteration

Merge sort (in-place or not) better than tim sort, and insertion sort becomes useless.

Conclusion

These experiments suggests that when the array is truly "nearly sorted", tim sort is the best choice. And when the array is "somewhat nearly sorted", in-place merge sort is the best choice. I'd consider the latter to be more common in real cases, so if we decide to replace merge sort, we'd better replace it with in-place merge sort. And if it's necessary we can add another tim sort for users who really know that their arrays are almost sorted.

liwt31 · 2018-10-26T03:56:58Z

While trying to improve these experiments with more kinds of permutation it came to me that maybe what we should do is to implement as many sorting algorithms as possible and leave the benchmark and decide-which-to-use to users. Some algorithms, such as tim sort, may have higher priority in the todo list of course.

charris · 2018-10-26T15:42:10Z

We want as few methods as serve, otherwise a lot of methods just sit around unused. The original three were basically "inplace, quick, not guaranteed", "out of place, stable, guaranteed", and "inplace guaranteed". With introsort replacing quicksort, we could probably drop "heapsort" at some point.

We cannot replace mergesort with inplace mergesort because the latter is not stable and we need a stable sort. So I think it comes down to timsort, which seems generally as good as mergesort, or trying some compiler tricks to improve mergesort, maybe NPY_LIKELY in some if statements.

charris · 2018-10-26T16:17:25Z

There is recent ongoing work on stable sorts, see https://github.com/BonzaiThePenguin/WikiSort for an example.

charris · 2018-10-26T19:42:01Z

I think timsort also relies on runs, so it might work better for data where you shuffle (or even reverse) a randomly selected portion of sorted data rather than just keeping data nearby so that the result has a lot of runs in it.

jondo · 2018-10-26T19:50:16Z

In the use case that made me file this issue, the data has got long runs and only a few small unsorted areas.

charris · 2018-10-26T20:00:29Z

Useful google: measures of sortedness. I think the measure we might want is "insertion index", n - length_of_longest_run.

@jondo The case of runs might be improved with a simple `NPY_LIKELY' or some such, which should allows parallel execution of the likely case in mergesort. @juliantaylor Thoughts?

charris · 2018-10-26T20:27:54Z

Well, NPY_LIKELY sped things up by 0.5%, so not much help :(

liwt31 · 2018-10-27T09:29:56Z

I'd prefer tim sort. It is well studied, benchmarked and tested in practice. After it's implemented we can do some benchmarks to decide if it should replace the old merge sort or not.

shoyer · 2019-05-10T21:15:18Z

Timsort has been merged!

jondo mentioned this issue Oct 16, 2018

DataFrame.sort_values should have an algorithm with 'best' complexity for nearly sorted data pandas-dev/pandas#23181

Closed

This was referenced Nov 11, 2018

MAINT: Refactor sorting header file #12366

Merged

BENCH: add benchmark suite for mergesort #12368

Open

liwt31 mentioned this issue Nov 19, 2018

ENH: Add timsort to npysort #12418

Merged

2 tasks

shoyer closed this as completed May 10, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sort should have an algorithm with 'best' complexity for nearly sorted data #12186

sort should have an algorithm with 'best' complexity for nearly sorted data #12186

jondo commented Oct 16, 2018 •

edited

charris commented Oct 16, 2018

liwt31 commented Oct 24, 2018 •

edited

mattip commented Oct 24, 2018

charris commented Oct 24, 2018

liwt31 commented Oct 24, 2018

juliantaylor commented Oct 25, 2018

charris commented Oct 25, 2018

charris commented Oct 25, 2018

charris commented Oct 25, 2018

liwt31 commented Oct 26, 2018 •

edited

liwt31 commented Oct 26, 2018

charris commented Oct 26, 2018

charris commented Oct 26, 2018 •

edited

charris commented Oct 26, 2018

jondo commented Oct 26, 2018

charris commented Oct 26, 2018 •

edited

charris commented Oct 26, 2018

liwt31 commented Oct 27, 2018

shoyer commented May 10, 2019

sort should have an algorithm with 'best' complexity for nearly sorted data #12186

sort should have an algorithm with 'best' complexity for nearly sorted data #12186

Comments

jondo commented Oct 16, 2018 • edited

charris commented Oct 16, 2018

liwt31 commented Oct 24, 2018 • edited

mattip commented Oct 24, 2018

charris commented Oct 24, 2018

liwt31 commented Oct 24, 2018

juliantaylor commented Oct 25, 2018

charris commented Oct 25, 2018

charris commented Oct 25, 2018

charris commented Oct 25, 2018

liwt31 commented Oct 26, 2018 • edited

Random generated array

Completely sorted array

Almost sorted array with 0.05% permutation (all elements away from sorted position by averagely 0.05% of total size of the array, i.e 50 elements)

Almost sorted array with 5% permutation (all elements away from sorted position by averagely 5% of total size of the array, 5000 elements)

Conclusion

liwt31 commented Oct 26, 2018

charris commented Oct 26, 2018

charris commented Oct 26, 2018 • edited

charris commented Oct 26, 2018

jondo commented Oct 26, 2018

charris commented Oct 26, 2018 • edited

charris commented Oct 26, 2018

liwt31 commented Oct 27, 2018

shoyer commented May 10, 2019

jondo commented Oct 16, 2018 •

edited

liwt31 commented Oct 24, 2018 •

edited

liwt31 commented Oct 26, 2018 •

edited

charris commented Oct 26, 2018 •

edited

charris commented Oct 26, 2018 •

edited