-
-
Notifications
You must be signed in to change notification settings - Fork 9.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sort should have an algorithm with 'best' complexity for nearly sorted data #12186
Comments
Note that there are already insertion sort implementations in |
Can I take on this? This is a somewhat big enhancement, tasks include:
EDIT: it's not necessary to modify |
Why do you need to touch If you feel the PR is very large, you make a merge request of a partial implementation, mark it WIP, and add a checklist so we know when/what to review. |
This may a bit much for a first project involving a lot of code. You are, of course, welcome to give it a try. |
Sorry, that was an inaccurate expression. This is what I meant originally.
I admit solving the issue is quite challenging, so thank you all for your (forthcoming) kind advice. |
I'd start with some benchmarks first. Prepare some datasets and just replace the quicksort with an insertion sort locally for the benchmarks. |
Quicksort falls back on insertion sort when the partition size is small, so all you need to do to test the timing is to change the cutoff to a bigger number in
Quicksort will then be insertion sort for arrays smaller than that. The problem will be putting together relevant test data and deciding what "almost sorted" means. Note that writing insertion sort comes down to copying the quicksort code and deleting large chunks of it :) |
It might be possible to also speed up mergsort for the almost sorted case. Which, now that I think of it, is what |
If Timsort is no worse than mergesort, we could simply replace mergesort under the covers, just like we did with quicksort, which is now introsort. |
I found a lib to do some quick benchmarks, here are some results (array size 100000, average over 10 times, -O3 optimization). Random generated array
Makes sense that quick sort is the fastest. Tim sort slightly worse than merge sort. Completely sorted array
Insertion sort is good, however the fastest is tim sort. Very impressive. Almost sorted array with 0.05% permutation (all elements away from sorted position by averagely 0.05% of total size of the array, i.e 50 elements)
Tim sort is not performing well with this kind of permutation, and insertion sort performs basically the same with tim sort. In-place merge sort becomes the fastest. Almost sorted array with 5% permutation (all elements away from sorted position by averagely 5% of total size of the array, 5000 elements)
Merge sort (in-place or not) better than tim sort, and insertion sort becomes useless. ConclusionThese experiments suggests that when the array is truly "nearly sorted", tim sort is the best choice. And when the array is "somewhat nearly sorted", in-place merge sort is the best choice. I'd consider the latter to be more common in real cases, so if we decide to replace merge sort, we'd better replace it with in-place merge sort. And if it's necessary we can add another tim sort for users who really know that their arrays are almost sorted. |
While trying to improve these experiments with more kinds of permutation it came to me that maybe what we should do is to implement as many sorting algorithms as possible and leave the benchmark and decide-which-to-use to users. Some algorithms, such as tim sort, may have higher priority in the todo list of course. |
We want as few methods as serve, otherwise a lot of methods just sit around unused. The original three were basically "inplace, quick, not guaranteed", "out of place, stable, guaranteed", and "inplace guaranteed". With introsort replacing quicksort, we could probably drop "heapsort" at some point. We cannot replace mergesort with inplace mergesort because the latter is not stable and we need a stable sort. So I think it comes down to timsort, which seems generally as good as mergesort, or trying some compiler tricks to improve mergesort, maybe |
There is recent ongoing work on stable sorts, see https://github.com/BonzaiThePenguin/WikiSort for an example. |
I think timsort also relies on runs, so it might work better for data where you shuffle (or even reverse) a randomly selected portion of sorted data rather than just keeping data nearby so that the result has a lot of runs in it. |
In the use case that made me file this issue, the data has got long runs and only a few small unsorted areas. |
Useful google: measures of sortedness. I think the measure we might want is "insertion index", @jondo The case of runs might be improved with a simple `NPY_LIKELY' or some such, which should allows parallel execution of the likely case in mergesort. @juliantaylor Thoughts? |
Well, |
I'd prefer tim sort. It is well studied, benchmarked and tested in practice. After it's implemented we can do some benchmarks to decide if it should replace the old merge sort or not. |
Timsort has been merged! |
I have got a large amount of data that is nearly sorted.
As e.g. nicely shown in these animations, a simple insertion sort could be the fastest in this case.
So I suggest offering this as an additional algorithm for the 'kind' argument.
(I also found a corresponding SO question.)
The text was updated successfully, but these errors were encountered: