There's a performance improved fork of this project #69

rogers0 · 2017-09-24T04:01:16Z

I'm debian pkg maintainer of your reedsolomon project.
It's got my attention that there's a fork of your project that claims big performance boost:

https://github.com/templexxx/reedsolomon

I asked the fork author to send the improvement patch to you upstream, but he/she seems kinda not interested in it.

possible to upstream your improvement? templexxx/reedsolomon#9

I looked at two project and find there's much differences that beyond my ability to make such patches.
So maybe you're interested in those improvements and can take a look at the project? Thanks!

klauspost · 2017-09-24T09:42:59Z

Honestly, I don't think the difference is that big. You can tweak the benchmarks in this package to operate on data in caches.

Running on my local Desktop (Core i7-2600 @3.4Ghz):

klauspost:
BenchmarkEncode10x4x16M-4                     30          41895236 ns/op        4004.56 MB/s
BenchmarkEncode10x4x16M-2                     30          45082120 ns/op        3721.48 MB/s
BenchmarkEncode10x4x16M                       20          72077065 ns/op        2327.68 MB/s

templexxx:
BenchmarkEnc/#01/10+4_16777216-8              20          63102500 ns/op        2658.72 MB/s

So single core performance is a big better, but nothing mindblowing.

I looked at the code, and it is pretty much the same. It has a bit more unrolling, but the code is memory limited anyway. I have privately done an unrolled version, but it didn't make any practical difference, so I just dropped it to keep complexity down.

Adding Cauchy matrix would be neat.

klauspost · 2017-09-24T17:04:53Z

Added #70 with Cauchy matrix.

templexxx · 2017-09-25T07:02:45Z

@klauspost

Thank your for your contribution, you really help me a lot 👍

Yes, the code is memory limited, and the loop-unrolling can't help much. ( but as a green hands, it's a good practice for me, :D)

I made some tricks for cache-friendly, so it will make performance improving when the shards size is big or the size can't divisible by 16 or the size is very small ( < 4KB).

for example: I split the shard ( about 16KB) to fit the L1 data cache, it's good for big shards

So I think BenchmarkEnc/#1/10+4_16777216-8 should be much faster than 2658.72 MB/s, maybe there was something wrong with that? It was much slower than I expect

I hope I'm not bothering you

Thanks

klauspost · 2017-09-25T08:10:47Z

It is probably because I am on a system without AVX2. Here is a benchmark with AVX2:

I did some tests, and it seems like the "maximum goroutines" could benefit from some adjustments:

templexxx:
BenchmarkEnc/#01/10+4_16777216                50          30111190 ns/op        5571.75 MB/s

WithMaxGoroutines(512)
$go test -short -bench=kEncode10x4x16M -cpu=8,4,2,1
pkg: github.com/klauspost/reedsolomon
BenchmarkEncode10x4x16M-8            100          13746566 ns/op        12204.66 MB/s
BenchmarkEncode10x4x16M-4            100          14333118 ns/op        11705.21 MB/s
BenchmarkEncode10x4x16M-2            100          21885522 ns/op        7665.90 MB/s
BenchmarkEncode10x4x16M               30          37515470 ns/op        4472.08 MB/s

WithMaxGoroutines(50) (default)
$go test -short -bench=kEncode10x4x16M -cpu=8,4,2,1
pkg: github.com/klauspost/reedsolomon
BenchmarkEncode10x4x16M-8             30          65574413 ns/op        2558.50 MB/s
BenchmarkEncode10x4x16M-4             30          47900543 ns/op        3502.51 MB/s
BenchmarkEncode10x4x16M-2             50          33218350 ns/op        5050.59 MB/s
BenchmarkEncode10x4x16M               20          52264020 ns/op        3210.09 MB/s

So there is still some performance to be had with some tweaks.

templexxx · 2017-09-25T10:35:40Z

@klauspost

Thank you for your testing

templexxx · 2017-09-25T10:47:26Z

@rogers0

@klauspost and I do same job in different way, we have same core codes. But I make it work on a single goroutine.

So I think it's hard to merge my codes to his.

And as @klauspost said, the code is memory limited. So I think there is no need to do that either.

I think klauspost can do much better job than me in many ways, I still need to learn from him a lot of
things.

fwessels · 2017-09-27T01:26:00Z

We also tested against this project but did not find significant performance differences between the two projects.

templexxx · 2017-09-27T01:50:09Z

@fwessels

yes, klauspost/reedsolomon run on multi-cores, mine is not.

that's the main difference

rogers0 mentioned this issue Sep 24, 2017

About dependency library reedsolomon xtaci/kcp-go#57

Closed

klauspost mentioned this issue Sep 24, 2017

Get some idea from other project using "GF SIMD" ? #65

Closed

klauspost closed this as completed Nov 18, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

There's a performance improved fork of this project #69

There's a performance improved fork of this project #69

rogers0 commented Sep 24, 2017

klauspost commented Sep 24, 2017

klauspost commented Sep 24, 2017

templexxx commented Sep 25, 2017

klauspost commented Sep 25, 2017

templexxx commented Sep 25, 2017

templexxx commented Sep 25, 2017

fwessels commented Sep 27, 2017

templexxx commented Sep 27, 2017

There's a performance improved fork of this project #69

There's a performance improved fork of this project #69

Comments

rogers0 commented Sep 24, 2017

klauspost commented Sep 24, 2017

klauspost commented Sep 24, 2017

templexxx commented Sep 25, 2017

klauspost commented Sep 25, 2017

templexxx commented Sep 25, 2017

templexxx commented Sep 25, 2017

fwessels commented Sep 27, 2017

templexxx commented Sep 27, 2017