Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

There's a performance improved fork of this project #69

Closed
rogers0 opened this issue Sep 24, 2017 · 8 comments
Closed

There's a performance improved fork of this project #69

rogers0 opened this issue Sep 24, 2017 · 8 comments

Comments

@rogers0
Copy link

rogers0 commented Sep 24, 2017

I'm debian pkg maintainer of your reedsolomon project.
It's got my attention that there's a fork of your project that claims big performance boost:

I asked the fork author to send the improvement patch to you upstream, but he/she seems kinda not interested in it.

I looked at two project and find there's much differences that beyond my ability to make such patches.
So maybe you're interested in those improvements and can take a look at the project? Thanks!

@klauspost
Copy link
Owner

Honestly, I don't think the difference is that big. You can tweak the benchmarks in this package to operate on data in caches.

Running on my local Desktop (Core i7-2600 @3.4Ghz):

klauspost:
BenchmarkEncode10x4x16M-4                     30          41895236 ns/op        4004.56 MB/s
BenchmarkEncode10x4x16M-2                     30          45082120 ns/op        3721.48 MB/s
BenchmarkEncode10x4x16M                       20          72077065 ns/op        2327.68 MB/s

templexxx:
BenchmarkEnc/#01/10+4_16777216-8              20          63102500 ns/op        2658.72 MB/s

So single core performance is a big better, but nothing mindblowing.

I looked at the code, and it is pretty much the same. It has a bit more unrolling, but the code is memory limited anyway. I have privately done an unrolled version, but it didn't make any practical difference, so I just dropped it to keep complexity down.

Adding Cauchy matrix would be neat.

@klauspost
Copy link
Owner

Added #70 with Cauchy matrix.

@templexxx
Copy link

@klauspost

Thank your for your contribution, you really help me a lot 👍

Yes, the code is memory limited, and the loop-unrolling can't help much. ( but as a green hands, it's a good practice for me, :D)

I made some tricks for cache-friendly, so it will make performance improving when the shards size is big or the size can't divisible by 16 or the size is very small ( < 4KB).

for example: I split the shard ( about 16KB) to fit the L1 data cache, it's good for big shards

So I think BenchmarkEnc/#1/10+4_16777216-8 should be much faster than 2658.72 MB/s, maybe there was something wrong with that? It was much slower than I expect

I hope I'm not bothering you

Thanks

@klauspost
Copy link
Owner

It is probably because I am on a system without AVX2. Here is a benchmark with AVX2:

I did some tests, and it seems like the "maximum goroutines" could benefit from some adjustments:

templexxx:
BenchmarkEnc/#01/10+4_16777216                50          30111190 ns/op        5571.75 MB/s

WithMaxGoroutines(512)
$go test -short -bench=kEncode10x4x16M -cpu=8,4,2,1
pkg: github.com/klauspost/reedsolomon
BenchmarkEncode10x4x16M-8            100          13746566 ns/op        12204.66 MB/s
BenchmarkEncode10x4x16M-4            100          14333118 ns/op        11705.21 MB/s
BenchmarkEncode10x4x16M-2            100          21885522 ns/op        7665.90 MB/s
BenchmarkEncode10x4x16M               30          37515470 ns/op        4472.08 MB/s

WithMaxGoroutines(50) (default)
$go test -short -bench=kEncode10x4x16M -cpu=8,4,2,1
pkg: github.com/klauspost/reedsolomon
BenchmarkEncode10x4x16M-8             30          65574413 ns/op        2558.50 MB/s
BenchmarkEncode10x4x16M-4             30          47900543 ns/op        3502.51 MB/s
BenchmarkEncode10x4x16M-2             50          33218350 ns/op        5050.59 MB/s
BenchmarkEncode10x4x16M               20          52264020 ns/op        3210.09 MB/s

So there is still some performance to be had with some tweaks.

@templexxx
Copy link

@klauspost

Thank you for your testing

@templexxx
Copy link

@rogers0

@klauspost and I do same job in different way, we have same core codes. But I make it work on a single goroutine.

So I think it's hard to merge my codes to his.

And as @klauspost said, the code is memory limited. So I think there is no need to do that either.

I think klauspost can do much better job than me in many ways, I still need to learn from him a lot of
things.

@fwessels
Copy link
Contributor

We also tested against this project but did not find significant performance differences between the two projects.

@templexxx
Copy link

@fwessels

yes, klauspost/reedsolomon run on multi-cores, mine is not.

that's the main difference

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants