Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sort: use pdqsort #50154

Closed
zhangyunhao116 opened this issue Dec 14, 2021 · 12 comments
Closed

sort: use pdqsort #50154

zhangyunhao116 opened this issue Dec 14, 2021 · 12 comments
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@zhangyunhao116
Copy link
Contributor

zhangyunhao116 commented Dec 14, 2021

Abstract

From ByteDance Programming Language Team

We suggest using pdqsort in the sort package. The new algorithm is mainly based on pattern-defeating quicksort by Orson Peters. Both Rust and C++ Boost have used pdqsort in their library.

  • Across all benchmarks, pdqsort is never significantly slower than the previous algorithm(in the built-in sort package).
  • In common patterns, pdqsort is often faster (i.e. 10x faster in sorted slices).

Implementations

  • Gerrit CL: https://go-review.googlesource.com/c/go/+/371574
  • stdpdqsort: based on sort.Interface. Same as the Gerrit CL, but contains more tests and benchmarks.
  • pdqsort: based on generics(need Go 1.18), about 2x ~ 60x faster than built-in package(only support Ordered types). Since Go1.18 is not released, this implementation is just a demo for now.

Algorithm details

The new algorithm is mainly based on pdqsort, but with these modifications:

  • Disable the optimization from BlockQuicksort, since its poor performance in Go.
  • Modify some parameters according to the benchmark results.

Known issues

  • The new algorithm is still an unstable sorting algorithm, it may rearrange equal elements differently compared to the original sorting algorithm in the sort package.

Benchmarks

Go version: go1.18-d6c4583ad4 linux/amd64

CPU: Intel 11700k(8C16T)

OS: ubuntu 20.04

MEMORY: 16G x 2 (3200MHz)

  • Benchmark from sort package. (go test sort -count=20 -run=NOTEST -bench=.)
name                   old time/op  new time/op  delta
SearchWrappers-16      72.8ns ± 3%  75.1ns ± 2%   +3.25%  (p=0.000 n=20+10)
SortString1K-16        85.2µs ± 3%  86.2µs ± 5%     ~     (p=0.247 n=19+10)
SortString1K_Slice-16  84.6µs ± 4%  86.1µs ± 4%     ~     (p=0.120 n=20+10)
StableString1K-16       112µs ± 5%   112µs ± 5%     ~     (p=0.604 n=19+10)
SortInt1K-16           44.8µs ± 3%  43.2µs ± 2%   -3.68%  (p=0.000 n=20+10)
SortInt1K_Sorted-16    28.2µs ± 3%   3.3µs ± 3%  -88.16%  (p=0.000 n=19+10)
SortInt1K_Reversed-16  29.4µs ± 3%   4.8µs ± 2%  -83.59%  (p=0.000 n=20+10)
SortInt1K_Mod8-16      25.1µs ± 2%  20.0µs ± 2%  -20.35%  (p=0.000 n=18+10)
StableInt1K-16         51.3µs ± 3%  50.9µs ± 2%     ~     (p=0.562 n=20+9)
StableInt1K_Slice-16   49.5µs ± 2%  50.7µs ± 4%   +2.55%  (p=0.009 n=19+10)
SortInt64K-16          4.73ms ± 3%  4.49ms ± 4%   -5.08%  (p=0.000 n=20+10)
SortInt64K_Slice-16    4.51ms ± 3%  4.35ms ± 1%   -3.42%  (p=0.000 n=20+8)
StableInt64K-16        4.85ms ± 2%  4.82ms ± 2%     ~     (p=0.267 n=20+10)
Sort1e2-16             27.9µs ± 1%  28.1µs ± 2%     ~     (p=0.198 n=20+10)
Stable1e2-16           56.6µs ± 2%  55.0µs ± 2%   -2.88%  (p=0.000 n=20+10)
Sort1e4-16             5.51ms ± 1%  5.36ms ± 1%   -2.58%  (p=0.000 n=19+9)
Stable1e4-16           17.8ms ± 1%  17.3ms ± 1%   -2.40%  (p=0.000 n=20+10)
Sort1e6-16              833ms ± 1%   807ms ± 1%   -3.02%  (p=0.000 n=20+10)
Stable1e6-16            3.49s ± 2%   3.44s ± 1%   -1.41%  (p=0.001 n=20+10)
  • Benchmark from stdpdqsort. (go test -run=NOTEST -bench=. -cpu=1 -benchtime=1000x -count=10 -timeout=60m > release.txt && benchstat release.txt)

    • Random: data[i] = rand.Int()

    • Sorted: data[i] = i

    • NearlySorted:

      for i := range x {
      	x[i] = i
      }
      for i := 0; i < len(x)/20; i++ {
      	a, b := rand.Intn(len(x)), rand.Intn(len(x))
      	x[a], x[b] = x[b], x[a]
      }
    • Reversed: data[i] = len(data) - i

    • NearlyReversed:

      for i := range x {
      	x[i] = len(x) - i
      }
      for i := 0; i < len(x)/20; i++ {
      	a, b := rand.Intn(len(x)), rand.Intn(len(x))
      	x[a], x[b] = x[b], x[a]
      }
    • Mod8: data[i] = i % 8

    name                          time/op
    Random/pdqsort_64             2.44µs ± 1%
    Random/stdsort_64             2.38µs ± 0%
    Random/pdqsort_256            13.1µs ± 7%
    Random/stdsort_256            13.8µs ± 0%
    Random/pdqsort_1024           62.3µs ± 1%
    Random/stdsort_1024           62.2µs ± 0%
    Random/pdqsort_4096            289µs ± 0%
    Random/stdsort_4096            290µs ± 0%
    Random/pdqsort_65536          6.08ms ± 0%
    Random/stdsort_65536          6.11ms ± 0%
    Sorted/pdqsort_64              249ns ± 0%
    Sorted/stdsort_64              709ns ± 1%
    Sorted/pdqsort_256             577ns ± 0%
    Sorted/stdsort_256            3.36µs ± 0%
    Sorted/pdqsort_1024           1.97µs ± 0%
    Sorted/stdsort_1024           16.6µs ± 0%
    Sorted/pdqsort_4096           7.31µs ± 0%
    Sorted/stdsort_4096           78.9µs ± 0%
    Sorted/pdqsort_65536           115µs ± 0%
    Sorted/stdsort_65536          1.73ms ± 0%
    NearlySorted/pdqsort_64        906ns ± 2%
    NearlySorted/stdsort_64       1.00µs ± 2%
    NearlySorted/pdqsort_256      4.52µs ± 1%
    NearlySorted/stdsort_256      5.01µs ± 1%
    NearlySorted/pdqsort_1024     21.2µs ± 0%
    NearlySorted/stdsort_1024     25.6µs ± 4%
    NearlySorted/pdqsort_4096      101µs ± 0%
    NearlySorted/stdsort_4096      117µs ± 1%
    NearlySorted/pdqsort_65536    2.14ms ± 0%
    NearlySorted/stdsort_65536    2.46ms ± 0%
    Reversed/pdqsort_64            320ns ± 1%
    Reversed/stdsort_64            845ns ± 0%
    Reversed/pdqsort_256           831ns ± 1%
    Reversed/stdsort_256          3.71µs ± 0%
    Reversed/pdqsort_1024         2.88µs ± 1%
    Reversed/stdsort_1024         17.9µs ± 1%
    Reversed/pdqsort_4096         10.9µs ± 1%
    Reversed/stdsort_4096         83.9µs ± 0%
    Reversed/pdqsort_65536         172µs ± 0%
    Reversed/stdsort_65536        1.80ms ± 0%
    NearlyReversed/pdqsort_64     1.17µs ± 1%
    NearlyReversed/stdsort_64     1.41µs ± 1%
    NearlyReversed/pdqsort_256    5.63µs ± 1%
    NearlyReversed/stdsort_256    7.24µs ± 1%
    NearlyReversed/pdqsort_1024   28.9µs ± 4%
    NearlyReversed/stdsort_1024   38.3µs ± 0%
    NearlyReversed/pdqsort_4096    136µs ± 1%
    NearlyReversed/stdsort_4096    176µs ± 0%
    NearlyReversed/pdqsort_65536  2.73ms ± 1%
    NearlyReversed/stdsort_65536  3.59ms ± 0%
    Mod8/pdqsort_64                878ns ± 1%
    Mod8/stdsort_64                941ns ± 0%
    Mod8/pdqsort_256              3.01µs ± 1%
    Mod8/stdsort_256              3.78µs ± 0%
    Mod8/pdqsort_1024             10.6µs ± 1%
    Mod8/stdsort_1024             14.3µs ± 0%
    Mod8/pdqsort_4096             45.7µs ± 0%
    Mod8/stdsort_4096             59.4µs ± 1%
    Mod8/pdqsort_65536             754µs ± 1%
    Mod8/stdsort_65536            1.01ms ± 0%
    AllEqual/pdqsort_64            250ns ± 0%
    AllEqual/stdsort_64            376ns ± 2%
    AllEqual/pdqsort_256           578ns ± 0%
    AllEqual/stdsort_256          1.02µs ± 1%
    AllEqual/pdqsort_1024         1.97µs ± 0%
    AllEqual/stdsort_1024         3.69µs ± 0%
    AllEqual/pdqsort_4096         7.20µs ± 0%
    AllEqual/stdsort_4096         14.0µs ± 0%
    AllEqual/pdqsort_65536         115µs ± 0%
    AllEqual/stdsort_65536         223µs ± 0%
    
@gopherbot gopherbot added this to the Proposal milestone Dec 14, 2021
@gopherbot
Copy link
Contributor

Change https://golang.org/cl/371574 mentions this issue: sort: use pdqsort

@robpike
Copy link
Contributor

robpike commented Dec 14, 2021

Without running the same set of benchmarks on the same data with the new and old code, it's hard to see the improvement.

@zhangyunhao116
Copy link
Contributor Author

Updated, add Sorted Reversed Mod8 to the built-in benchmark.

@esote
Copy link

esote commented Dec 14, 2021

Reran the stdpdqsort benchmarks to get a delta. EC2 t2.micro (1 vCPU, 1 GiB mem), goos: linux, goarch: amd64

name                       old time/op    new time/op    delta
Random/sort_64               6.57µs ± 2%    6.67µs ± 1%   +1.52%  (p=0.002 n=10+9)
Random/sort_256              30.6µs ± 5%    31.2µs ± 5%     ~     (p=0.052 n=10+10)
Random/sort_1024              142µs ± 1%     144µs ± 1%   +1.73%  (p=0.000 n=10+10)
Random/sort_4096              674µs ± 1%     674µs ± 0%     ~     (p=0.739 n=10+10)
Random/sort_65536            14.2ms ± 0%    14.1ms ± 0%   -0.68%  (p=0.000 n=10+10)
Sorted/sort_64               2.87µs ± 4%    1.31µs ± 8%  -54.49%  (p=0.000 n=10+10)
Sorted/sort_256              11.7µs ± 0%     2.4µs ± 3%  -79.34%  (p=0.000 n=10+10)
Sorted/sort_1024             54.8µs ± 6%     7.0µs ± 2%  -87.31%  (p=0.000 n=10+10)
Sorted/sort_4096              253µs ± 1%      25µs ± 1%  -90.31%  (p=0.000 n=9+10)
Sorted/sort_65536            5.53ms ± 0%    0.35ms ± 1%  -93.61%  (p=0.000 n=9+10)
NearlySorted/sort_64         3.36µs ± 1%    3.13µs ± 5%   -6.69%  (p=0.000 n=8+10)
NearlySorted/sort_256        14.7µs ± 1%    13.2µs ± 1%   -9.63%  (p=0.000 n=9+8)
NearlySorted/sort_1024       69.2µs ± 4%    59.6µs ± 1%  -13.94%  (p=0.000 n=10+8)
NearlySorted/sort_4096        320µs ± 1%     283µs ± 1%  -11.45%  (p=0.000 n=9+10)
NearlySorted/sort_65536      6.79ms ± 0%    6.03ms ± 1%  -11.17%  (p=0.000 n=9+10)
Reversed/sort_64             3.25µs ± 3%    1.53µs ± 8%  -52.90%  (p=0.000 n=8+10)
Reversed/sort_256            12.7µs ± 2%     3.1µs ± 3%  -75.43%  (p=0.000 n=10+10)
Reversed/sort_1024           55.5µs ± 1%     9.8µs ± 3%  -82.40%  (p=0.000 n=9+10)
Reversed/sort_4096            265µs ± 1%      36µs ± 3%  -86.58%  (p=0.000 n=10+8)
Reversed/sort_65536          5.68ms ± 0%    0.51ms ± 0%  -90.98%  (p=0.000 n=9+10)
NearlyReversed/sort_64       4.44µs ± 1%    3.93µs ± 2%  -11.38%  (p=0.000 n=7+9)
NearlyReversed/sort_256      21.1µs ± 1%    16.8µs ± 1%  -20.16%  (p=0.000 n=10+10)
NearlyReversed/sort_1024     96.8µs ± 3%    77.3µs ± 2%  -20.13%  (p=0.000 n=10+9)
NearlyReversed/sort_4096      453µs ± 1%     365µs ± 1%  -19.42%  (p=0.000 n=10+10)
NearlyReversed/sort_65536    9.22ms ± 0%    7.40ms ± 2%  -19.75%  (p=0.000 n=8+10)
Mod8/sort_64                 3.39µs ± 4%    3.32µs ± 7%     ~     (p=0.190 n=9+10)
Mod8/sort_256                12.5µs ± 1%    10.3µs ± 1%  -17.87%  (p=0.000 n=10+10)
Mod8/sort_1024               49.1µs ± 6%    39.6µs ± 6%  -19.43%  (p=0.000 n=9+10)
Mod8/sort_4096                200µs ± 2%     149µs ± 2%  -25.53%  (p=0.000 n=10+10)
Mod8/sort_65536              2.98ms ± 0%    2.20ms ± 1%  -26.34%  (p=0.000 n=10+10)
AllEqual/sort_64             1.68µs ± 6%    1.33µs ± 7%  -20.73%  (p=0.000 n=10+10)
AllEqual/sort_256            3.94µs ± 1%    2.49µs ± 3%  -36.69%  (p=0.000 n=9+10)
AllEqual/sort_1024           12.9µs ± 1%     6.9µs ± 3%  -46.14%  (p=0.000 n=9+10)
AllEqual/sort_4096           46.4µs ± 5%    24.8µs ± 1%  -46.68%  (p=0.000 n=10+8)
AllEqual/sort_65536           697µs ± 0%     352µs ± 1%  -49.50%  (p=0.000 n=10+10)
[Geo mean]                   78.7µs         40.8µs       -48.17%

@ianlancetaylor
Copy link
Contributor

I think the only reason this could be considered as a proposal is that it will change the sorting of equal elements when using sort.Sort. If that is OK then this is just an implementation change. (Personally, I think that the change in sorting is OK.)

@esote
Copy link

esote commented Dec 15, 2021

Since sort.Sort is not guaranteed to be stable I think it's safe to change. If particular sorting for equal elements was needed, sort.Stable or a total ordering function for Less should be used

@rsc
Copy link
Contributor

rsc commented Dec 15, 2021

We've changed the sort results in the past. There was a release that greatly reduced the number of comparisons but changed sort orders. So changing again if we have significant optimizations to apply should be fine (modulo reviews, code being fine, etc).

@rsc rsc changed the title proposal: sort: use pdqsort sort: use pdqsort Dec 15, 2021
@ianlancetaylor ianlancetaylor added NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. and removed Proposal labels Dec 15, 2021
@ianlancetaylor ianlancetaylor modified the milestones: Proposal, Backlog Dec 15, 2021
@thepudds
Copy link
Contributor

thepudds commented Mar 29, 2022

Hi @zhangyunhao116

Disable the optimization from BlockQuicksort, since its poor performance in Go.

Would you mind saying a few words about what you tried for this, and any theories as to why it had poor performance in Go?

For example, was the Go compiler not generating similar cmov or setcc instructions as a C++ or Rust compiler, or maybe it was something else?

(My rough understanding is that BlockQuicksort was a very useful optimization in the original pdqsort. Maybe someone will have an idea about how to write it in a way that is friendly to the current Go compiler, or maybe there is a future compiler optimization that might make it more workable for Go, or ...).

Thank you for your work on this -- it looks to be a very nice improvement!

@zhangyunhao116
Copy link
Contributor Author

For example, was the Go compiler not generating similar cmov or setcc instructions as a C++ or Rust compiler, or maybe it was something else?

Yes, the Go compiler looks like not generating CMOV as the Rust compiler does. It would be nice if you could find a better way to write this code :)

	for i := 0; i < BLOCK; i++ {
		offsetsL[numL] = L + i
		if pivot <= data[L+i] {
			numL++ // Line 61
		}
	}

This code will generate something like

        main.go:61      0x45ca15        488b542450                      mov rdx, qword ptr [rsp+0x50]
        main.go:61      0x45ca1a        48ffc2                          inc rdx
        main.go:61      0x45ca1d        4889542450                      mov qword ptr [rsp+0x50], rdx

Other reasons maybe cache miss, or the BlockQuickSort needs more stack for each call(BLOCK*2). The benchmark is based on a global pre-allocated block, but the performance is still not good.

A more important reason is that sort.Interface doesn't support storing its data into another block, we can only use the BlockQuickSort in the generic version of pdqsort. Currently, pdqsort(generic) and pdqsort(sort.Interface) are generated from the same code in the src/sort/gen_sort_variants.go. So the BlockQuickSort is dropped in the current implementation.

cc @thepudds

gopherbot pushed a commit that referenced this issue Apr 13, 2022
- Across all benchmarks, pdqsort is never significantly slower than the previous algorithm.
- In common patterns, pdqsort is often faster (i.e. 10x faster in sorted slices).

The pdqsort is described at https://arxiv.org/pdf/2106.05123.pdf

This CL is inspired by both C++ implementation and Rust implementation.
- C++ implementation: https://github.com/orlp/pdqsort
- Rust implementation: https://docs.rs/pdqsort/latest/pdqsort/

For #50154

name                   old time/op  new time/op  delta
SearchWrappers-16      72.8ns ± 3%  75.1ns ± 2%   +3.25%  (p=0.000 n=20+10)
SortString1K-16        85.2µs ± 3%  86.2µs ± 5%     ~     (p=0.247 n=19+10)
SortString1K_Slice-16  84.6µs ± 4%  86.1µs ± 4%     ~     (p=0.120 n=20+10)
StableString1K-16       112µs ± 5%   112µs ± 5%     ~     (p=0.604 n=19+10)
SortInt1K-16           44.8µs ± 3%  43.2µs ± 2%   -3.68%  (p=0.000 n=20+10)
SortInt1K_Sorted-16    28.2µs ± 3%   3.3µs ± 3%  -88.16%  (p=0.000 n=19+10)
SortInt1K_Reversed-16  29.4µs ± 3%   4.8µs ± 2%  -83.59%  (p=0.000 n=20+10)
SortInt1K_Mod8-16      25.1µs ± 2%  20.0µs ± 2%  -20.35%  (p=0.000 n=18+10)
StableInt1K-16         51.3µs ± 3%  50.9µs ± 2%     ~     (p=0.562 n=20+9)
StableInt1K_Slice-16   49.5µs ± 2%  50.7µs ± 4%   +2.55%  (p=0.009 n=19+10)
SortInt64K-16          4.73ms ± 3%  4.49ms ± 4%   -5.08%  (p=0.000 n=20+10)
SortInt64K_Slice-16    4.51ms ± 3%  4.35ms ± 1%   -3.42%  (p=0.000 n=20+8)
StableInt64K-16        4.85ms ± 2%  4.82ms ± 2%     ~     (p=0.267 n=20+10)
Sort1e2-16             27.9µs ± 1%  28.1µs ± 2%     ~     (p=0.198 n=20+10)
Stable1e2-16           56.6µs ± 2%  55.0µs ± 2%   -2.88%  (p=0.000 n=20+10)
Sort1e4-16             5.51ms ± 1%  5.36ms ± 1%   -2.58%  (p=0.000 n=19+9)
Stable1e4-16           17.8ms ± 1%  17.3ms ± 1%   -2.40%  (p=0.000 n=20+10)
Sort1e6-16              833ms ± 1%   807ms ± 1%   -3.02%  (p=0.000 n=20+10)
Stable1e6-16            3.49s ± 2%   3.44s ± 1%   -1.41%  (p=0.001 n=20+10)

Change-Id: Iecded047d237b9330b5a4101001a5fdc2f50646a
Reviewed-on: https://go-review.googlesource.com/c/go/+/371574
Reviewed-by: Emmanuel Odeke <emmanuel@orijtech.com>
Run-TryBot: Ian Lance Taylor <iant@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
Run-TryBot: Keith Randall <khr@golang.org>
Auto-Submit: Keith Randall <khr@golang.org>
Reviewed-by: Keith Randall <khr@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Eli Bendersky <eliben@google.com>
@gopherbot
Copy link
Contributor

Change https://go.dev/cl/399315 mentions this issue: slices: use pdqsort

gopherbot pushed a commit to golang/exp that referenced this issue Apr 26, 2022
Sync with CL 371574.

- pdqsort paper: https://arxiv.org/pdf/2106.05123.pdf
- C++ implementation: https://github.com/orlp/pdqsort
- Rust implementation: https://docs.rs/pdqsort/latest/pdqsort/

For golang/go#50154

name                       old time/op  new time/op  delta
SortInts-16                10.8ms ± 3%  10.8ms ± 3%     ~     (p=0.461 n=20+20)
SlicesSortInts-16          6.14ms ± 3%  6.29ms ± 3%   +2.43%  (p=0.000 n=20+19)
SlicesSortInts_Sorted-16   1.78ms ± 4%  0.09ms ± 2%  -95.01%  (p=0.000 n=18+20)
SlicesSortInts_Reversed-16 1.82ms ± 5%  0.13ms ± 2%  -92.67%  (p=0.000 n=20+20)
SortStrings-16             23.0ms ± 2%  23.1ms ± 1%     ~     (p=0.253 n=20+20)
SlicesSortStrings-16       19.1ms ± 2%  19.0ms ± 2%     ~     (p=0.270 n=20+19)
SortStructs-16             15.7ms ± 4%  15.7ms ± 3%     ~     (p=0.989 n=20+20)
SortFuncStructs-16         13.6ms ± 1%  12.7ms ± 3%   -6.39%  (p=0.000 n=19+20)

Change-Id: Ia563fe1c70d6d2ac098a97e1f870040d038cf61c
Reviewed-on: https://go-review.googlesource.com/c/exp/+/399315
Reviewed-by: Keith Randall <khr@golang.org>
Run-TryBot: Keith Randall <khr@golang.org>
Auto-Submit: Keith Randall <khr@golang.org>
Reviewed-by: Keith Randall <khr@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Eli Bendersky <eliben@google.com>
@zhangyunhao116
Copy link
Contributor Author

All changes have been submitted, so closing this issue. Thanks all!

@MaoJianwei
Copy link
Contributor

Good job! Long time to be merged. Cheers~
@zhangyunhao116

@golang golang locked and limited conversation to collaborators Jun 28, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

No branches or pull requests

8 participants