Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: sort: Stable with better asymptotic time complexity #25137

Open
nsajko opened this issue Apr 27, 2018 · 19 comments
Open

proposal: sort: Stable with better asymptotic time complexity #25137

nsajko opened this issue Apr 27, 2018 · 19 comments

Comments

@nsajko
Copy link
Contributor

@nsajko nsajko commented Apr 27, 2018

The proposed change does not add any new API and is backwards compatible.

The current implementation of Stable is very simple, but not satisfactorily performant and executes in O(n*log(n)*log(n)) time.

I propose implementing Stable using an O(n*log(n)) algorithm which is, though, much more complex. See https://go-review.googlesource.com/c/go/+/101415

Code complexity: the following functions with noted cyclomatic complexities are added:

Cyclomatic complexity function
118 sort stable sort.go:918:1
19 sort merge sort.go:666:1
16 sort findBDSAndCountDistinctElementsNear sort.go:420:1
13 sort findBDSFarAndCountDistinctElements sort.go:497:1
8 sort simpleMergeBufBigSmall sort.go:625:1
4 sort distinctElementCount sort.go:389:1
3 sort extractDist sort.go:573:1
3 sort equalRange sort.go:562:1
3 sort max sort.go:333:1
3 sort lessBlocks sort.go:613:1
2 sort maxDepth sort.go:223:1
1 sort searchLess sort.go:328:1
1 sort isqrter sort.go:348:1

Performance:

Benchmark results using func bench (it sorts pseudorandom data) from sort_test.go (benchmark names include input array size):

Benchmark Improvement
Stable1e4 10% (from 110 ms to 100 ms)
Stable1e6 31% (from 23 s to 16 s)
Stable1e7 36% (from 1.9 min to 1.2 min)
Stable1e8 43% (from 64 min to 36 min)

The code in master is faster for input arrays with very many identical elements, but those are fast to sort anyways (in both the old and new code they are orders of magnitude faster to sort than random data).

@gopherbot gopherbot added this to the Proposal milestone Apr 27, 2018
@gopherbot gopherbot added the Proposal label Apr 27, 2018
@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Apr 27, 2018

What are the effects on memory allocation? That is, is any extra memory allocated during the sort?

@nsajko

This comment has been minimized.

Copy link
Contributor Author

@nsajko nsajko commented Apr 27, 2018

What are the effects on memory allocation? That is, is any extra memory allocated during the sort?

There is an array of 5 * 256 uint8 (but the constant is tunable if needed) and a logarithmic amount of stack space will be used if there are calls to symMerge or quickSort because they are recursive (those two exist in sort.go already in master). Neither of those are essential to the algorithm, but were added as optimizations.

@ancaemanuel

This comment has been minimized.

Copy link

@ancaemanuel ancaemanuel commented Apr 27, 2018

Why somebody will need it over the standard library ?

It is better in general random case ? Numbers please.
It is better in some particular case ? And why is that case important ?

@nsajko

This comment has been minimized.

Copy link
Contributor Author

@nsajko nsajko commented Apr 28, 2018

It is better in general random case ? Numbers please.

Yeah, look under "Benchmark results" in my first post here, I do not know how it could be missed. For more details see the linked CL at Go's Gerrit.

EDIT: sorry, I failed to realize that not everybody is familiar with sort_test.go, the benchmark results posted here (produced by func bench) are from sorting pseudorandom data.

@nsajko

This comment has been minimized.

Copy link
Contributor Author

@nsajko nsajko commented Apr 30, 2018

It is interesting to note that with gcc 7.3.1 my code beats the libstdc++ stable_sort when limited to sublinear space/memory usage starting from about array size 1e8! I sort of ported func bench from sort_test.go to C++ and when compiled with c++ c++StableSortBench.cc the C++ program takes 38 minutes/2277 seconds (at 1.2 GHz CPU frequency) to sort the arrays like in func bench, while my Go code takes just 36 minutes/2189 seconds.

(To clarify, both func bench and my C++ "port" sort seven arrays filled with pseudorandom keys and sizes ranging from 1e8-3 to 1e8+3 inclusive. The durations are the sums from sorting all seven arrays.)

I guess that means libstdc++ also uses an O(n*log(n)*log(n)) algorithm ...

NB: when compiled with compiler optimizations turned on, the C++ code is 4.7 times faster than when they are turned off (which is the default); thus much larger array sizes would be needed for my Go code to beat that.

See the C++ code here: https://github.com/nsajko/cxxStableSortBench

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Apr 30, 2018

Is this the same as the suggested algorithm in #7610?

@nsajko

This comment has been minimized.

Copy link
Contributor Author

@nsajko nsajko commented Apr 30, 2018

Is this the same as the suggested algorithm in #7610?

Yes and no.

#7610 proposes using the 2008 merging algorithm by Kim and Kutzner which is the basis of my code. On the other hand #7610 then links to Wikisort, which is not the same as my sort; and in fact uses a more than slightly modified version of the merging algorithm from Kim and Kutzner.

@griesemer

This comment has been minimized.

Copy link
Contributor

@griesemer griesemer commented Apr 30, 2018

I'm copying from the CL description:

The new implementation is a merge sort with the merge based on Kim, P.S.
and Kutzner, A., 2008, April. Ratio based stable in-place merging. In
International Conference on Theory and Applications of Models of
Computation (pp. 246-257). Springer, Berlin, Heidelberg.

Having an asymptotically better stable sort seems like a good thing to have. At the same time there's significant amounts non-trivial of new code (~1000 lines + generated code) that needs thorough vetting.

@nsajko Is there anyone besides you that could help validate/review your implementation and confirm its matching the above paper? Ideally we need another expert in sorting algorithms. Perhaps @vdobler ?

@nsajko

This comment has been minimized.

Copy link
Contributor Author

@nsajko nsajko commented May 1, 2018

@nsajko Is there anyone besides you that could help validate/review your implementation and confirm its matching the above paper? Ideally we need another expert in sorting algorithms. Perhaps @vdobler ?

Sadly, I do not know anybody who could help. It would certainly be appreciated if @vdobler or anybody else could review the code.

@ancaemanuel

This comment has been minimized.

Copy link

@ancaemanuel ancaemanuel commented May 1, 2018

Ok. The complexity of the algorithm looks good, adding papers to that increase the confidence that you are doing an great job, but the amount of code added really worth it ?
I think, in some cases it will. But is not my decision.

@vdobler

This comment has been minimized.

Copy link
Contributor

@vdobler vdobler commented May 2, 2018

I'll try to review it, but it is really difficult:

  • The main merging loop is simply too complex, has too many live variables with too cryptic names differing from the paper.
  • The implementation is not only the implementation of StableOptimalBlockMerge from Kim&Kutzner but contains several (presumably necessary) optimizations.
  • The current CL uses SymMerge instead if Hwang&Lin which makes the code much simpler but I think this makes the complexity analysis of the paper invalid, or at least needs work to re-prove it.

The first two points might impose a fundamental problem:
A CL including all the necessary optimizations will be un-reviewable while a straight implementation of the algorithm (and thus digestible) will not show much (if at all!) performance gain and thus won't get merged.

The current code of https://go-review.googlesource.com/c/go/+/101415/41
is neither maintainable nor production ready. I'm very interested in having a
asymptotically better Stable in the standard library and I'm going to invest
significant time in understanding Kim&Kutzners paper as well as the implementation.

@griesemer

This comment has been minimized.

Copy link
Contributor

@griesemer griesemer commented May 2, 2018

@vdobler Thank you for offering to review!

One option to make progress, and perhaps starting with a straight algorithm that can be more easily reviewed and vetted as correct, would be to have this as a second implementation to what we have now, with an appropriately documented (constant) flag to enable/disable. This would permit checking in code that is disabled when checked in, but can be reviewed and tested nevertheless on individual machines simply by enabling the flag. Eventually, we will be able to switch the flag.

But before we go there I like to bring this up in the proposal review committee and solicit other opinions. It's not clear to me that this should be a proposal in the first place, for one.

Either way, this is certainly not for 1.11 (maybe for 1.12) so there is quite a bit of time here.

@nsajko

This comment has been minimized.

Copy link
Contributor Author

@nsajko nsajko commented May 2, 2018

Thanks for your criticisms @vdobler, I will try to address just some of your points for now.

Before addressing your points, I just realized that I seemingly failed to mention at all one difference between the Kim&Kutzner paper and my code: K&K put the BDS and MIB on low-index edges of arrays, while I put them on high-index edges, and with that also come other changes like (in func merge) rolling the subblocks of the high-index block instead of the low-index block and a lot of index variables pointing to high-index edges where they point to low-index edges in the paper. I will document that in the next PS.

The main merging loop is simply too complex, has too many live variables with too cryptic names differing from the paper.

Do you mean the "Local merges" part in func merge or the entirety of func merge?

The implementation is not only the implementation of StableOptimalBlockMerge from Kim&Kutzner but contains several (presumably necessary) optimizations.

Presumably you mean the "outOfInputData" MIB and BDS which cause func merge to take two Interface parameters?

The current CL uses SymMerge instead if Hwang&Lin which makes the code much simpler but I think this makes the complexity analysis of the paper invalid, or at least needs work to re-prove it.

In general symMerge is (for sorted arrays u and v and m == |u| == |v|) O(m * log(m)); while Hwang&Lin (the version based on rotations, without a workspace buffer) is O(m * m) in general, with the "extended" version described in the paper being O(λ * m) (see Lemma 3 in the paper), where λ is the number of distinct elements in one of u or v; but AFAIU λ equals m-1 worst case when we use the algorithm for local merges (in func merge), making it O(m * m). O(m * m) is asymptotically greater than O(m * log(m)), thus using symMerge instead of Hwang and Lin does not invalidate the complexity analysis. I hope I did not get something wrong. (Sidenote: please take a look at Theorem 1 in the paper. It proves that the merge takes O(n) assignments, but the reasoning for the case where the rotation based variant of Hwang&Lin should be used for local merges is left to the reader to prove in the end, with a pointer to Lemma 3. To be honest, it seems to me that the proof does not even work for that case, that is that the merge is O(n * sqrt(n)) in that case instead of O(n). So PTAL at the end of Theorem 1.)

A CL including all the necessary optimizations will be un-reviewable while a straight implementation of the algorithm (and thus digestible) will not show much (if at all!) performance gain and thus won't get merged.

Actually, I think the outOfInputData-buffer optimization should not impact the performance of sorting large arrays (like 1e8). But I do not think removing that would simplify the code very much.

@vdobler

This comment has been minimized.

Copy link
Contributor

@vdobler vdobler commented May 2, 2018

@nsajko

K&K put the BDS and MIB on low-index edges of arrays, while I put them on high-index edges

Why? Is there a technical reason? The algorithm is complex and not widely known
and the diagrams from the paper are helpful but only if they do match the
code.

Do you mean the "Local merges" part in func merge or the entirety of func merge?

The whole and any part of it. Let me explain what I would consider reviewable:

  • Merge calling a function stableOptimalBlockMerge which implements Stable-Optimal-Block-Merge from the paper.
  • stableOptimalBlockMerge calling functions which each implements one of the five steps of Stable-Optimal-Block-Merge from the paper.
  • Grouping everything MIB related stuff (were it is located, be it as part of data or "outOfInput") into a struct with appropriate methods.
  • Same for the BDS. One BDS struct instead of dozens of integer variables.

Presumably you mean the "outOfInputData" MIB and BDS which cause func merge to take two Interface parameters?

Yes and every comment of the form "as an optimization..."
But this is not the place for detailed code review.
Please understand that I do have problems reviewing the code in the current form.

We need to agree on nomenclature. I think your comment is based on the specialization
of the paper to m = n i.e. k = 1 and the big-Os you give are for the number of assignments?
Doing the math will take me some time.

Actually, I think the outOfInputData-buffer optimization should not impact the performance of sorting large arrays (like 1e8). But I do not think removing that would simplify the code very much.

While reading the paper and your code I thought it might make the code
much simpler if there where a (opaque) movementImitationBuffer which does
its work with out-of-input or MIB as part of data being an internal detail.
In a first, simplified algorithm one could use a dynamically allocated out-of-input
MIB always, add an data-based MIB in a second step and a mixture of
data-bases and out-of-input MIB in a third step. All nicely encapsulated in
a type hiding the details.

@vdobler

This comment has been minimized.

Copy link
Contributor

@vdobler vdobler commented May 3, 2018

@griesemer That seems like a very reasonable plan.
I'm curious about the review committees opinion.

@nsajko

This comment has been minimized.

Copy link
Contributor Author

@nsajko nsajko commented May 3, 2018

@vdobler

K&K put the BDS and MIB on low-index edges of arrays, while I put them on high-index edges

Why? Is there a technical reason?

The natural way to implement a merge sort is to start from the low indices, possibly leaving an undersized block at the high-index end. The reason is to merge this undersized block more efficiently.

  • Merge calling a function stableOptimalBlockMerge which implements Stable-Optimal-Block-Merge from the paper.
  • stableOptimalBlockMerge calling functions which each implements one of the five steps of Stable-Optimal-Block-Merge from the paper.

func merge only implements the steps 3 (Block rearrangement) and 4 (Local merges), with appropriate code blocks noted with comments. I could separate most of the code of func merge into two separate functions if you really want it, but I think it is simpler like it is now, because the integer variables can be used all through the code, instead of having to recompute them.

  • Grouping everything MIB related stuff (were it is located, be it as part of data or "outOfInput") into a struct with appropriate methods.
  • Same for the BDS. One BDS struct instead of dozens of integer variables.

But the paper also uses integer variables ...

While reading the paper and your code I thought it might make the code
much simpler if there where a (opaque) movementImitationBuffer which does
its work with out-of-input or MIB as part of data being an internal detail.

But that is what we already have in my code: outOfInputDataMIBuffer implements Interface and func merge does not ever know if aux Interface is internal MIB and BDS, or the outOfInputData variants.

@vdobler

This comment has been minimized.

Copy link
Contributor

@vdobler vdobler commented May 3, 2018

@nsajko

The reason is to merge this undersized block more efficiently.

That's a fine explanation and a valid reason. Now this must find i's way into
the description of the algorithm.

For the rest: It seems as I am totally inapt in explaining the problem with
the code of patchset 41. Let me try to be very clear: It doesn't matter
if you find the code easy, simple efficient and handy if I don't understand
it during review. Of course does the paper use integer indices but there
are literally dozens variables like bds0, bds1, backupBDS0, backupBDS1,
buf, bufLastDiEl and of course I could make an educated guess what
bds0 and bds1 are and see if the code matches my guess, but that's
simply not how code should look like.

If we are following Robert's idea of iteratively building up the new algorithm
in small and reviewable steps you will have to start over and come up
with code digestible for the ones reviewing it.

But that is what we already have in my code: outOfInputDataMIBuffer implements Interface and func merge does not ever know if aux Interface is internal MIB and BDS, or the outOfInputData variants.

Only partly: aux cannot live alone and does not abstract the whole
"keep track of what we have to undo later": All the inidces have to be
passed and manipulated. The aux idea is clever and I think it might
be a good idea to take this idea further so that the code becomes simpler
for all who didn't write it.

@nsajko

This comment has been minimized.

Copy link
Contributor Author

@nsajko nsajko commented May 3, 2018

Please continue with codereview here: nsajko/go#1

Also, all issues on that repo will be relevant: https://github.com/nsajko/go/issues

@rsc

This comment has been minimized.

Copy link
Contributor

@rsc rsc commented May 7, 2018

Putting on hold until the code has been reviewed and cleaned up. Part of the decision here will be informed by how much complexity we are taking on, so it makes sense to wait to evaluate that until the code is as clean as it can be made.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
7 participants
You can’t perform that action at this time.