Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

huff0: decompress directly into output #577

Merged

Conversation

WojciechMula
Copy link
Collaborator

@WojciechMula WojciechMula commented May 5, 2022

Solves #576.
Benchmarks on Skylake and Ice Lake machines didn't show any boost. (On my very old laptop there's an improvement). It's reasonable, as perf report for tests shows that 95% of the time is spent in decompress routines. But benchmarks of zstd show some improvement.

I'm benchmarking on AWS machines and they are not that reliable for microbenchmarks. They can show a few per cent differences for two subsequent runs of the same code.

WojciechMula added 2 commits May 5, 2022
The Go implementation uses an auxiliary buffer to avoid extensive bound checks.
In the asm implementation we know in advance the output buffer is big enough,
so it's possible to write directly to that buffer.
@klauspost
Copy link
Owner

klauspost commented May 6, 2022

@WojciechMula We can just abandon #743

I fixed out-of-bounds writes, since we didn't check if we exceeded the write buffer.
I re-added the separate fill routine from the other PR.
bufferOrigin can be a register in decompress4x_main_loop_amd64.
exhausted doesn't use high/low bytes anymore. This can be slow on older CPUs.
Fixed build tags

Please check out/upgrade to: https://gist.github.com/klauspost/306ed219fb5e275d7c9abb87c9a7b142

@WojciechMula
Copy link
Collaborator Author

WojciechMula commented May 6, 2022

@klauspost I agree, this change is quite small, there's no point in having to PRs.

Thanks for the fixes, will incorporate them.

(Added you as a collabolator to my fork, so you don't have to play with gists. Sorry, I should have done it earlier.)

@WojciechMula
Copy link
Collaborator Author

WojciechMula commented May 6, 2022

Regarding the build tags: seems that avo invoked for Go 1.18 does not include the old build tags. Yesterday I tried to figure it out in the avo sources but didn't find where it's checked. I'll recheck that in the evening.

huff0/decompress_amd64.go Outdated Show resolved Hide resolved
@klauspost
Copy link
Owner

klauspost commented May 6, 2022

Aside from the crash from above fuzz tests are running fine 👍🏼

@klauspost
Copy link
Owner

klauspost commented May 6, 2022

FWIW I tried reading 56 bits and doing decode 4 and 6 values for decompress4x_main_loop_amd64 and decompress4x_8b_main_loop_amd64 respectively.

Reading 6 instead of 4 values is significantly slower (~20% deg).

The only gain is BenchmarkDecompress4XNoTable/twain/262143-32 699.10 814.60 1.17x and BenchmarkDecompress4XNoTable/gettysburg/262143-32 799.28 822.87 1.03x reading 4 values instead of 2 per loop.

Copy link
Owner

@klauspost klauspost left a comment

benchmark                                                   old ns/op     new ns/op     delta
BenchmarkDecompress4XNoTable/digits/100-32                  334           336           +0.66%
BenchmarkDecompress4XNoTable/digits/10000-32                10835         9562          -11.75%
BenchmarkDecompress4XNoTable/digits/262143-32               303585        270811        -10.80%
BenchmarkDecompress4XNoTable/gettysburg/100-32              285           286           +0.56%
BenchmarkDecompress4XNoTable/gettysburg/10000-32            11393         10268         -9.87%
BenchmarkDecompress4XNoTable/gettysburg/262143-32           327973        289561        -11.71%
BenchmarkDecompress4XNoTable/twain/100-32                   331           330           -0.27%
BenchmarkDecompress4XNoTable/twain/10000-32                 11458         10235         -10.67%
BenchmarkDecompress4XNoTable/twain/262143-32                374970        345400        -7.89%
BenchmarkDecompress4XNoTable/low-ent.10k/100-32             367           371           +1.01%
BenchmarkDecompress4XNoTable/low-ent.10k/10000-32           10812         9398          -13.08%
BenchmarkDecompress4XNoTable/low-ent.10k/262143-32          256684        221666        -13.64%
BenchmarkDecompress4XNoTable/superlow-ent-10k/262143-32     256839        221322        -13.83%
BenchmarkDecompress4XNoTable/case1/100-32                   318           322           +1.23%
BenchmarkDecompress4XNoTable/case1/10000-32                 10803         9562          -11.49%
BenchmarkDecompress4XNoTable/case1/262143-32                277377        242147        -12.70%
BenchmarkDecompress4XNoTable/case2/100-32                   345           340           -1.62%
BenchmarkDecompress4XNoTable/case2/10000-32                 10659         9473          -11.13%
BenchmarkDecompress4XNoTable/case2/262143-32                268723        236376        -12.04%
BenchmarkDecompress4XNoTable/case3/100-32                   333           336           +0.99%
BenchmarkDecompress4XNoTable/case3/10000-32                 10737         9357          -12.85%
BenchmarkDecompress4XNoTable/case3/262143-32                272268        239011        -12.21%
BenchmarkDecompress4XNoTable/pngdata.001/100-32             361           350           -2.99%
BenchmarkDecompress4XNoTable/pngdata.001/10000-32           11583         10740         -7.28%
BenchmarkDecompress4XNoTable/pngdata.001/262143-32          306257        279850        -8.62%
BenchmarkDecompress4XNoTable/normcount2/100-32              287           289           +0.80%
BenchmarkDecompress4XNoTable/normcount2/10000-32            10832         9696          -10.49%
BenchmarkDecompress4XNoTable/normcount2/262143-32           279908        247688        -11.51%
BenchmarkDecompress4XNoTableTableLog8/digits-32             107990        96969         -10.21%

huff0/_generate/gen.go Outdated Show resolved Hide resolved
huff0/_generate/gen.go Outdated Show resolved Hide resolved
huff0/_generate/gen.go Outdated Show resolved Hide resolved
Co-authored-by: Klaus Post <klauspost@gmail.com>
@WojciechMula
Copy link
Collaborator Author

WojciechMula commented May 6, 2022

Your benchmarks look really good! And match expectations... I'm still puzzled by the regressions observed on my Ice Lake machine.

WojciechMula and others added 8 commits May 6, 2022
Co-authored-by: Klaus Post <klauspost@gmail.com>
Co-authored-by: Klaus Post <klauspost@gmail.com>
Co-authored-by: Klaus Post <klauspost@gmail.com>
Presence of this tag depends on the Go version on which `go generate` was run,
see mmcloughlin/avo#183.
@WojciechMula
Copy link
Collaborator Author

WojciechMula commented May 6, 2022

Regarding the build tags: seems that avo invoked for Go 1.18 does not include the old build tags. Yesterday I tried to figure it out in the avo sources but didn't find where it's checked. I'll recheck that in the evening.

@klauspost mmcloughlin/avo#183. It would be great if these settings may be overridden in runtime.

@klauspost klauspost merged commit e7c028f into klauspost:master May 7, 2022
12 checks passed
@mmcloughlin
Copy link
Contributor

mmcloughlin commented May 8, 2022

Can you add this to the generate workflow in Github actions?

@klauspost
Copy link
Owner

klauspost commented May 9, 2022

Can you add this to the generate workflow in Github actions?

Adding in #582

@WojciechMula WojciechMula deleted the huff0-decompress-directly-into-output branch May 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants