-
Notifications
You must be signed in to change notification settings - Fork 246
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
There's a performance improved fork of this project #69
Comments
Honestly, I don't think the difference is that big. You can tweak the benchmarks in this package to operate on data in caches. Running on my local Desktop (Core i7-2600 @3.4Ghz):
So single core performance is a big better, but nothing mindblowing. I looked at the code, and it is pretty much the same. It has a bit more unrolling, but the code is memory limited anyway. I have privately done an unrolled version, but it didn't make any practical difference, so I just dropped it to keep complexity down. Adding Cauchy matrix would be neat. |
Added #70 with Cauchy matrix. |
Thank your for your contribution, you really help me a lot 👍 Yes, the code is memory limited, and the loop-unrolling can't help much. ( but as a green hands, it's a good practice for me, :D) I made some tricks for cache-friendly, so it will make performance improving when the shards size is big or the size can't divisible by 16 or the size is very small ( < 4KB). for example: I split the shard ( about 16KB) to fit the L1 data cache, it's good for big shards So I think BenchmarkEnc/#1/10+4_16777216-8 should be much faster than 2658.72 MB/s, maybe there was something wrong with that? It was much slower than I expect I hope I'm not bothering you Thanks |
It is probably because I am on a system without AVX2. Here is a benchmark with AVX2: I did some tests, and it seems like the "maximum goroutines" could benefit from some adjustments:
So there is still some performance to be had with some tweaks. |
Thank you for your testing |
@klauspost and I do same job in different way, we have same core codes. But I make it work on a single goroutine. So I think it's hard to merge my codes to his. And as @klauspost said, the code is memory limited. So I think there is no need to do that either. I think klauspost can do much better job than me in many ways, I still need to learn from him a lot of |
We also tested against this project but did not find significant performance differences between the two projects. |
yes, klauspost/reedsolomon run on multi-cores, mine is not. that's the main difference |
I'm debian pkg maintainer of your reedsolomon project.
It's got my attention that there's a fork of your project that claims big performance boost:
I asked the fork author to send the improvement patch to you upstream, but he/she seems kinda not interested in it.
I looked at two project and find there's much differences that beyond my ability to make such patches.
So maybe you're interested in those improvements and can take a look at the project? Thanks!
The text was updated successfully, but these errors were encountered: