Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: drastically improve performances of sort.Sort with big slices #14653

Closed
rressi opened this Issue Mar 5, 2016 · 8 comments

Comments

Projects
None yet
6 participants
@rressi
Copy link

rressi commented Mar 5, 2016

Please answer these questions before submitting your issue. Thanks!

  1. What version of Go are you using (go version)?
    go1.6 darwin/amd64
  2. What operating system and processor architecture are you using (go env)?
    OSX, x86_64
  3. What did you do?
    I'm trying to sort a text file with millions of lines of code.
func main() {

    flag.Parse()
    if flag.NArg() != 2 {
        log.Fatal("Syntax error: sort <srcPath> <dstPath>")
    }

    srcPath := flag.Arg(0)
    dstPath := flag.Arg(1)
    log.Printf("Sorting [%v] -> [%v]", srcPath, dstPath)

    srcBytes, err := ioutil.ReadFile(srcPath)
    if err != nil {
        log.Fatalf("%s", err)
    }
    log.Printf("%v bytes loaded", len(srcBytes))

    lines := strings.Split(string(srcBytes), "\n")
    log.Printf("%v lines found", len(lines))

    sort.Strings(lines)
    log.Printf("%v lines sorted", len(lines))

    dstBytes := []byte(strings.Join(lines, "\n"))
    log.Printf("%v bytes joined", len(dstBytes))

    err = ioutil.WriteFile(dstPath, dstBytes, 0666)
    if err != nil {
        log.Fatalf("%s", err)
    }
    log.Printf("%v bytes saved", len(dstBytes))
}
  1. What did you expect to see?
    My file sorted, but faster
  2. What did you see instead?
    My file sorted, but slower

Story

Current implementation of sort.Sort is based on QuickSort algorithm. It is a very good choice but as many other have a CPU complexity of N * log(N).

This means that if we can split our task into k smaller tasks of size n with a CPU complexity of N we can improve performances:

N + k * (n * log(n)) < N log(N) where k * n = N

My experiments have proved it quite easily:

  • I joined all text lines of the Linux Kernel's sources into a single file (stripped, kept only lines with len >= 2). The outcome was a file of 14 millions of lines of code.
  • I sort the file with 2 different strategies:
    • Naive approach: Load into a huge slice. Apply sort.Sort on it. Save it back to the file.
    • Faster approach: While loading I split lines into buckets using as a key first two letters (key := int(line[0]) * 256 + int(line[1]); then I sort each bucket; I save all basket's lines that are already naturally sorted.
    // Allocates all possible buckets:
    const KEY_SIZE = 2
    buckets = make([][]string, 1 << (KEY_SIZE * 8))

    // Loads text lines from source file and distributes it to their buckets:
    scan := bufio.NewScanner(fd)
    for scan.Scan() {
        line := scan.Text()

        key := 0
        multiplier := len(buckets)
        for i := 0; i < KEY_SIZE && i < len(line); i++ {
            multiplier >>= 8
            key += multiplier * int(line[i])
        }

        buckets[key] = append(buckets[key], line)
        numLines++
    }
    ...
    var j int
    for i, bucket := range buckets {
        if len(bucket) == 0 {
            continue
        }

        sort.Strings(bucket)
        numLines += int64(len(bucket))

        if i != j {
            buckets[j] = bucket
            buckets[i] = nil
        }
        j++
    }

    sortedBuckets = buckets[:j]

The second algorithm was much faster: from 17 seconds to 10 seconds.

Moreover other seconds can be easily earned by using go-routines because buckets are independent each other; with my experiments I reached 7 seconds on a 4 cores machine.

Proposal

My idea is to provide a set of functions like the following (names and documentation must be improved):

sort.RadixSortStrings(src []string) <-chan string
sort.RadixSortInts(src []string) <-chan int
...

type InterfaceRadix interface {

        Interface

        // Generates a key-radix for the element at position i convenient for passed number of buckets.
        // Returned keys need to be naturally sortable in the same way as objects to be sorted.
        // Returned keys must be 0 <= key <= numBuckets
        // Different objects can generate the same value, but is convenient to have an homogeneous distribution of returned values.
        Radix(i, numBuckets int) int
}
sort.RadixSort(src InterfaceRadix) <-chan InterfaceRadix

These functions:

  • would perform sort out of place (a common requirement).
  • the sort would make explicit use of go-routines because sorting all the buckets is an embarrassingly parallel algorithm.
  • return elements back using a channel so that while sorting the user can already process first elements.
  • Interface InterfaceRadix would be already implemented for base types.

Having this set of functions would give to Go a competitive advantage over other languages on modern applications where sorting millions of elements is quite usual.

This functions may prove to be competitive already with just tens/hundreds thousands of elements.

@rressi

This comment has been minimized.

Copy link
Author

rressi commented Mar 5, 2016

I offer myself to provide an implementation of it if the proposal would be accepted, but I prefer to wait and see how my original idea is improved by this community.

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

ianlancetaylor commented Mar 5, 2016

I think the first step would be to write this code in a go-gettable package. I don't see an obvious reason that it has to start out in the standard library. Starting outside will provide more flexibility for API and implementation choices. An external package can still use sort.Interface.

@rressi

This comment has been minimized.

Copy link
Author

rressi commented Mar 5, 2016

It is a good point, and I'd probably proceed that way.

The only thing I don't want to miss, is to have feedback about these ideas from more people, in order to create something that fits better for Go development community.

Often reasoning a bit more over an interface, and its implementation, bring better results. And this is probably the best place to involve Go developers with their ideas.

@minux

This comment has been minimized.

Copy link
Member

minux commented Mar 5, 2016

@rressi

This comment has been minimized.

Copy link
Author

rressi commented Mar 6, 2016

@minux @ianlancetaylor You both have good points.

I just published a first draft of my idea here:

And I will start using golang-nuts in order to involve more people as soon as I have nice benchmarks to present and a script that would let everybody to test my findings on they own.

I think we can also close this issue for now hoping to come back later with something more valuable for this project.

@OneOfOne

This comment has been minimized.

Copy link
Contributor

OneOfOne commented Mar 6, 2016

I went with a different approach, I copied stdlib's quickSort and made a naive parallel version of it.

It shows 15%+ improvement for slices >= 10k elements, it uses sort.Interface and it's a drop-in replacement, in my tests I just had sort.Sort pick psort if a.Len() >= 5000.

the code is here.

update, due to #14677 the updated benchmarks are:

name              old time/op    new time/op    delta
NewIndexRandom-8     202ms ± 1%     177ms ± 2%   -12.29%  (p=0.008 n=5+5)
NewIndexRepeat-8     339ms ± 2%     279ms ± 2%   -17.82%  (p=0.008 n=5+5)

SortString1K-8     152µs ± 1%     157µs ± 0%    +3.51%          (p=0.008 n=5+5)
SortInt1K-8       70.3µs ± 1%    75.2µs ± 1%    +6.94%          (p=0.008 n=5+5)
SortInt64K-8      6.44ms ± 1%    2.60ms ± 1%   -59.70%          (p=0.008 n=5+5)
Sort1e2-8         40.7µs ± 1%    41.5µs ± 3%      ~             (p=0.056 n=5+5)
Sort1e4-8         8.62ms ± 3%    5.31ms ± 0%   -38.33%          (p=0.008 n=5+5)
Sort1e6-8          1.30s ± 2%     0.69s ± 2%   -46.52%          (p=0.008 n=5+5)

old

# go test -v -bench=NewIndex -benchmem -count 5
➜ benchstat {old,new}.txt
name              old time/op    new time/op    delta
NewIndexRandom-8     203ms ± 0%     208ms ± 2%    +2.49%  (p=0.008 n=5+5)
NewIndexRepeat-8     339ms ± 2%     402ms ± 2%   +18.86%  (p=0.008 n=5+5)

name              old alloc/op   new alloc/op   delta
NewIndexRandom-8    16.1MB ± 0%    16.1MB ± 0%    +0.08%  (p=0.008 n=5+5)
NewIndexRepeat-8    39.0MB ± 0%    39.0MB ± 0%    +0.00%  (p=0.008 n=5+5)

name              old allocs/op  new allocs/op  delta
NewIndexRandom-8      33.6 ±10%     299.2 ± 1%  +790.48%  (p=0.008 n=5+5)
NewIndexRepeat-8      73.8 ±11%     103.2 ± 2%   +39.84%  (p=0.008 n=5+5)

# go test -bench=kSort -count 5 -benchmem
➜ benchstat {old,new}.txt
name            old time/op    new time/op    delta
SortString1K-8     157µs ± 1%     198µs ± 2%   +26.64%          (p=0.008 n=5+5)
SortInt1K-8       73.2µs ± 1%   115.2µs ± 0%   +57.30%          (p=0.008 n=5+5)
SortInt64K-8      6.84ms ± 2%    4.11ms ± 3%   -39.93%          (p=0.008 n=5+5)
Sort1e2-8         42.8µs ± 1%    42.5µs ± 1%      ~             (p=0.421 n=5+5)  # this falls back to the standard sort.
Sort1e4-8         8.60ms ± 1%    7.23ms ± 1%   -15.99%          (p=0.008 n=5+5)
Sort1e6-8          1.35s ± 7%     0.98s ± 7%   -27.46%          (p=0.008 n=5+5)

name            old alloc/op   new alloc/op   delta
SortString1K-8     32.0B ± 0%     80.4B ± 1%  +151.25%          (p=0.008 n=5+5)
SortInt1K-8        32.0B ± 0%     80.0B ± 0%  +150.00%          (p=0.008 n=5+5)
SortInt64K-8       32.0B ± 0%     80.0B ± 0%  +150.00%          (p=0.008 n=5+5)
Sort1e2-8           224B ± 0%      224B ± 0%      ~     (all samples are equal)
Sort1e4-8           224B ± 0%      560B ± 0%  +150.00%          (p=0.008 n=5+5)
Sort1e6-8           224B ± 0%      560B ± 0%  +150.00%          (p=0.008 n=5+5)

name            old allocs/op  new allocs/op  delta
SortString1K-8      1.00 ± 0%      2.00 ± 0%  +100.00%          (p=0.008 n=5+5)
SortInt1K-8         1.00 ± 0%      2.00 ± 0%  +100.00%          (p=0.008 n=5+5)
SortInt64K-8        1.00 ± 0%      2.00 ± 0%  +100.00%          (p=0.008 n=5+5)
Sort1e2-8           7.00 ± 0%      7.00 ± 0%      ~     (all samples are equal)
Sort1e4-8           7.00 ± 0%     14.00 ± 0%  +100.00%          (p=0.008 n=5+5)
Sort1e6-8           7.00 ± 0%     14.00 ± 0%  +100.00%          (p=0.008 n=5+5)
@rressi

This comment has been minimized.

Copy link
Author

rressi commented Mar 7, 2016

Great job!
How many CPU cores were involved?

Using it in production code the problem is that often CPUs are a resource that may be used by concurrent tasks, so using more CPUs just to earn 15% may easily make the overall system slower due to less efficient use of the resources. That is the reason I think a robust base implementation that use a single thread must stay as default flavor inside stdlib.

I'm trying to make a solution for a slightly different problem than sort.Sort:

  • Big numbers: we have more objects than usual to sort, so worth to use more CPUs in order to reduce the time to sort ideally by an order of magnitude.
  • Being functional: we want out of the box sort, we don't want to move the objects but instead to have a list of indices to be used later to iterate through the original series in a sorted fashion without touching the whole series. Suppose you need to iterate the same series in different manners and you don't want to make more copies of it.

When we need to scale up, even N log(N) vs N can make the difference. The usage of more CPUs still should be either optional either close to the ideal scenario where speed is multiplied by number of cores.

But lets share the numbers I have on my machine in the last implementation here:

$ GOMAXPROCS=1 go test -bench=. big
PASS
BenchmarkRadixSort_1k       5000        368449 ns/op
BenchmarkRadixSort_10k       500       3313324 ns/op
BenchmarkRadixSort_100k       30      42800573 ns/op
BenchmarkRadixSort_1M          2     525938359 ns/op
BenchmarkRadixSort_10M         1    7420893394 ns/op
BenchmarkSort_1k            5000        248871 ns/op
BenchmarkSort_10k            500       3232176 ns/op
BenchmarkSort_100k            30      39452604 ns/op
BenchmarkSort_1M               2     574389730 ns/op
BenchmarkSort_10M              1    7698980123 ns/op
ok      big 60.586s

With only one CPU we have similar speeds, not good it is getting slightly better only for big numbers but not enough, function that generate the radix need improvements because we are still closer to N log N than to N.

For small numbers, no point, sort.Sort is better.

Lets scale up:

$ GOMAXPROCS=2 go test -bench=. big
PASS
BenchmarkRadixSort_1k-2         5000        294862 ns/op
BenchmarkRadixSort_10k-2        1000       2186454 ns/op
BenchmarkRadixSort_100k-2         50      25943125 ns/op
BenchmarkRadixSort_1M-2            5     306473763 ns/op
BenchmarkRadixSort_10M-2           1    4313142617 ns/op
BenchmarkSort_1k-2              5000        247954 ns/op
BenchmarkSort_10k-2              500       3218643 ns/op
BenchmarkSort_100k-2              30      39316613 ns/op
BenchmarkSort_1M-2                 2     578850846 ns/op
BenchmarkSort_10M-2                1    7710922509 ns/op
ok      big 55.384s
N difference
1k -18%
10k +32%
100k + 34%
1M +47%
10M +44%

It may worth the cost of using 2 CPUs, it may not.

Lets scale up:

$ GOMAXPROCS=4 go test -bench=. big
PASS
BenchmarkRadixSort_1k-4         5000        251724 ns/op
BenchmarkRadixSort_10k-4        1000       1712307 ns/op
BenchmarkRadixSort_100k-4         50      22316161 ns/op
BenchmarkRadixSort_1M-4            5     208881485 ns/op
BenchmarkRadixSort_10M-4           1    2851469957 ns/op
BenchmarkSort_1k-4              5000        247612 ns/op
BenchmarkSort_10k-4              500       3225756 ns/op
BenchmarkSort_100k-4              30      39707499 ns/op
BenchmarkSort_1M-4                 2     579981271 ns/op
BenchmarkSort_10M-4                1    7748238902 ns/op
ok      big 48.582s
N difference
1k -2%
10k +46%
100k + 43%
1M +63%
10M +63%

Is clear that the CPU complexity of the 2 algorithms is still the same (my function extracting the radix is not efficient) but the new one is able to use more than one core but with not great efficiency.

Here come the bad news, I added memory profiling to the game:

$ go test -bench=. -benchmem big
PASS
BenchmarkRadixSort_1k-4         5000        254219 ns/op       59906 B/op        723 allocs/op
BenchmarkRadixSort_10k-4        1000       1702797 ns/op      422103 B/op       2685 allocs/op
BenchmarkRadixSort_100k-4         50      22269063 ns/op     4228728 B/op      10617 allocs/op
BenchmarkRadixSort_1M-4            5     209942919 ns/op    43352969 B/op      61476 allocs/op
BenchmarkRadixSort_10M-4           1    2809521198 ns/op    468514128 B/op    222642 allocs/op
BenchmarkSort_1k-4              5000        247852 ns/op          32 B/op          1 allocs/op
BenchmarkSort_10k-4              500       3219610 ns/op          32 B/op          1 allocs/op
BenchmarkSort_100k-4              30      39398951 ns/op          32 B/op          1 allocs/op
BenchmarkSort_1M-4                 2     577199850 ns/op          32 B/op          1 allocs/op
BenchmarkSort_10M-4                1    7674743659 ns/op          32 B/op          1 allocs/op

Conclusion: the new algorithm must be improved, before making a proposal.

@adg

This comment has been minimized.

Copy link
Contributor

adg commented Mar 9, 2016

Hi, thanks for your proposal.

As was mentioned upthread, this is the kind of project that should be thoroughly developed outside the Go core before we can think about integrating it. That being the case, I'm going to close this issue.

Thanks again.

@adg adg closed this Mar 9, 2016

@golang golang locked and limited conversation to collaborators Mar 13, 2017

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.