Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zstd: Branchless getBits for amd64 w/o BMI2 #640

Merged
merged 1 commit into from Jul 12, 2022

Conversation

greatroar
Copy link
Contributor

@greatroar greatroar commented Jul 8, 2022

This produces the same number of instructions, while requiring less generating code. Benchmarks on the Intel Core i7-3770K show a tiny speedup:

name                                                        old speed      new speed      delta
Decoder_DecoderSmall/kppkn.gtb.zst-8                         430MB/s ± 1%   437MB/s ± 1%  +1.60%  (p=0.000 n=10+9)
Decoder_DecoderSmall/geo.protodata.zst-8                    1.11GB/s ± 1%  1.13GB/s ± 0%  +1.37%  (p=0.000 n=9+9)
Decoder_DecoderSmall/plrabn12.txt.zst-8                      334MB/s ± 1%   339MB/s ± 1%  +1.41%  (p=0.000 n=9+10)
Decoder_DecoderSmall/lcet10.txt.zst-8                        392MB/s ± 2%   404MB/s ± 1%  +3.05%  (p=0.000 n=10+10)
Decoder_DecoderSmall/asyoulik.txt.zst-8                      355MB/s ± 2%   357MB/s ± 1%    ~     (p=0.315 n=10+9)
Decoder_DecoderSmall/alice29.txt.zst-8                       344MB/s ± 1%   350MB/s ± 1%  +1.69%  (p=0.000 n=10+10)
Decoder_DecoderSmall/html_x_4.zst-8                         2.34GB/s ± 1%  2.37GB/s ± 1%  +1.10%  (p=0.000 n=10+10)
Decoder_DecoderSmall/paper-100k.pdf.zst-8                   3.75GB/s ± 0%  3.76GB/s ± 1%    ~     (p=0.182 n=9+10)
Decoder_DecoderSmall/fireworks.jpeg.zst-8                   8.59GB/s ± 1%  8.58GB/s ± 1%    ~     (p=0.842 n=10+9)
Decoder_DecoderSmall/urls.10K.zst-8                          561MB/s ± 1%   556MB/s ± 1%  -0.82%  (p=0.019 n=10+10)
Decoder_DecoderSmall/html.zst-8                              900MB/s ± 1%   913MB/s ± 1%  +1.42%  (p=0.000 n=10+9)
Decoder_DecoderSmall/comp-data.bin.zst-8                     399MB/s ± 1%   395MB/s ± 1%  -0.99%  (p=0.000 n=10+10)
Decoder_DecodeAll/kppkn.gtb.zst-8                            518MB/s ± 0%   526MB/s ± 0%  +1.52%  (p=0.000 n=10+9)
Decoder_DecodeAll/geo.protodata.zst-8                       1.28GB/s ± 0%  1.27GB/s ± 2%    ~     (p=0.739 n=10+10)
Decoder_DecodeAll/plrabn12.txt.zst-8                         427MB/s ± 1%   433MB/s ± 1%  +1.24%  (p=0.000 n=10+10)
Decoder_DecodeAll/lcet10.txt.zst-8                           480MB/s ± 1%   490MB/s ± 1%  +2.06%  (p=0.000 n=10+10)
Decoder_DecodeAll/asyoulik.txt.zst-8                         435MB/s ± 0%   447MB/s ± 0%  +2.70%  (p=0.000 n=7+9)
Decoder_DecodeAll/alice29.txt.zst-8                          422MB/s ± 0%   438MB/s ± 1%  +3.96%  (p=0.000 n=8+9)
Decoder_DecodeAll/html_x_4.zst-8                            1.60GB/s ± 0%  1.61GB/s ± 0%  +0.99%  (p=0.000 n=9+10)
Decoder_DecodeAll/paper-100k.pdf.zst-8                      4.55GB/s ± 1%  4.44GB/s ± 1%  -2.42%  (p=0.000 n=10+10)
Decoder_DecodeAll/fireworks.jpeg.zst-8                      9.52GB/s ± 1%  9.47GB/s ± 2%    ~     (p=0.143 n=10+10)
Decoder_DecodeAll/urls.10K.zst-8                             678MB/s ± 1%   684MB/s ± 0%  +0.83%  (p=0.000 n=10+10)
Decoder_DecodeAll/html.zst-8                                1.05GB/s ± 0%  1.07GB/s ± 1%  +2.11%  (p=0.000 n=10+10)
Decoder_DecodeAll/comp-data.bin.zst-8                        397MB/s ± 1%   391MB/s ± 1%  -1.37%  (p=0.000 n=10+10)
Decoder_DecodeAllFiles/Mark.Twain-Tom.Sawyer.txt/fastest-8   437MB/s ± 0%   436MB/s ± 1%  -0.21%  (p=0.025 n=9+9)
Decoder_DecodeAllFiles/Mark.Twain-Tom.Sawyer.txt/default-8   448MB/s ± 0%   451MB/s ± 0%  +0.70%  (p=0.000 n=9+9)
Decoder_DecodeAllFiles/Mark.Twain-Tom.Sawyer.txt/better-8    478MB/s ± 0%   475MB/s ± 0%  -0.53%  (p=0.000 n=10+10)
Decoder_DecodeAllFiles/Mark.Twain-Tom.Sawyer.txt/best-8      461MB/s ± 0%   470MB/s ± 0%  +2.07%  (p=0.000 n=8+9)
Decoder_DecodeAllFiles/e.txt/fastest-8                      9.62GB/s ± 3%  9.62GB/s ± 2%    ~     (p=1.000 n=10+10)
Decoder_DecodeAllFiles/e.txt/default-8                       391MB/s ± 0%   406MB/s ± 0%  +3.81%  (p=0.000 n=10+8)
Decoder_DecodeAllFiles/e.txt/better-8                        438MB/s ± 0%   448MB/s ± 0%  +2.39%  (p=0.000 n=8+10)
Decoder_DecodeAllFiles/e.txt/best-8                          500MB/s ± 0%   500MB/s ± 0%    ~     (p=0.119 n=9+9)
Decoder_DecodeAllFiles/fse-artifact3.bin/fastest-8          1.07GB/s ± 1%  1.04GB/s ± 1%  -2.61%  (p=0.000 n=10+10)
Decoder_DecodeAllFiles/fse-artifact3.bin/default-8          1.21GB/s ± 1%  1.19GB/s ± 1%  -1.33%  (p=0.000 n=10+10)
Decoder_DecodeAllFiles/fse-artifact3.bin/better-8            994MB/s ± 0%   990MB/s ± 0%  -0.42%  (p=0.002 n=10+9)
Decoder_DecodeAllFiles/fse-artifact3.bin/best-8              389MB/s ± 0%   381MB/s ± 0%  -2.00%  (p=0.000 n=8+10)
Decoder_DecodeAllFiles/gettysburg.txt/fastest-8              274MB/s ± 1%   274MB/s ± 1%    ~     (p=1.000 n=10+10)
Decoder_DecodeAllFiles/gettysburg.txt/default-8              224MB/s ± 1%   223MB/s ± 1%  -0.64%  (p=0.015 n=10+10)
Decoder_DecodeAllFiles/gettysburg.txt/better-8               228MB/s ± 1%   227MB/s ± 1%  -0.40%  (p=0.041 n=10+10)
Decoder_DecodeAllFiles/gettysburg.txt/best-8                 225MB/s ± 1%   223MB/s ± 0%  -0.52%  (p=0.008 n=10+6)
Decoder_DecodeAllFiles/html.txt/fastest-8                    599MB/s ± 1%   614MB/s ± 1%  +2.41%  (p=0.000 n=10+10)
Decoder_DecodeAllFiles/html.txt/default-8                    601MB/s ± 0%   613MB/s ± 0%  +2.01%  (p=0.000 n=8+9)
Decoder_DecodeAllFiles/html.txt/better-8                     626MB/s ± 1%   638MB/s ± 0%  +1.99%  (p=0.000 n=10+10)
Decoder_DecodeAllFiles/html.txt/best-8                       601MB/s ± 0%   612MB/s ± 0%  +1.87%  (p=0.000 n=10+10)
Decoder_DecodeAllFiles/pi.txt/fastest-8                     9.64GB/s ± 2%  9.66GB/s ± 1%    ~     (p=0.529 n=10+10)
Decoder_DecodeAllFiles/pi.txt/default-8                      390MB/s ± 0%   403MB/s ± 0%  +3.48%  (p=0.000 n=10+10)
Decoder_DecodeAllFiles/pi.txt/better-8                       439MB/s ± 0%   451MB/s ± 0%  +2.65%  (p=0.000 n=10+10)
Decoder_DecodeAllFiles/pi.txt/best-8                         500MB/s ± 0%   499MB/s ± 0%  -0.27%  (p=0.009 n=7+10)
Decoder_DecodeAllFiles/pngdata.bin/fastest-8                1.70GB/s ± 1%  1.69GB/s ± 1%  -0.63%  (p=0.013 n=10+9)
Decoder_DecodeAllFiles/pngdata.bin/default-8                1.52GB/s ± 1%  1.51GB/s ± 0%  -0.75%  (p=0.000 n=10+9)
Decoder_DecodeAllFiles/pngdata.bin/better-8                 1.92GB/s ± 0%  1.90GB/s ± 0%  -1.02%  (p=0.000 n=10+10)
Decoder_DecodeAllFiles/pngdata.bin/best-8                   1.47GB/s ± 0%  1.46GB/s ± 0%  -0.88%  (p=0.000 n=10+9)
Decoder_DecodeAllFiles/sharnd.out/fastest-8                 9.60GB/s ± 1%  9.67GB/s ± 1%  +0.67%  (p=0.029 n=10+10)
Decoder_DecodeAllFiles/sharnd.out/default-8                 9.65GB/s ± 2%  9.71GB/s ± 1%    ~     (p=0.353 n=10+10)
Decoder_DecodeAllFiles/sharnd.out/better-8                  9.67GB/s ± 1%  9.66GB/s ± 0%    ~     (p=0.549 n=10+9)
Decoder_DecodeAllFiles/sharnd.out/best-8                    9.70GB/s ± 1%  9.61GB/s ± 0%  -0.91%  (p=0.010 n=10+9)
[Geo mean]                                                   935MB/s        940MB/s       +0.57%

This produces the same number of instructions, while requiring less
generating code. Benchmarks on the Intel Core i7-3770K show a tiny
speedup:

name                                                        old speed      new speed      delta
Decoder_DecoderSmall/kppkn.gtb.zst-8                         430MB/s ± 1%   437MB/s ± 1%  +1.60%  (p=0.000 n=10+9)
Decoder_DecoderSmall/geo.protodata.zst-8                    1.11GB/s ± 1%  1.13GB/s ± 0%  +1.37%  (p=0.000 n=9+9)
Decoder_DecoderSmall/plrabn12.txt.zst-8                      334MB/s ± 1%   339MB/s ± 1%  +1.41%  (p=0.000 n=9+10)
Decoder_DecoderSmall/lcet10.txt.zst-8                        392MB/s ± 2%   404MB/s ± 1%  +3.05%  (p=0.000 n=10+10)
Decoder_DecoderSmall/asyoulik.txt.zst-8                      355MB/s ± 2%   357MB/s ± 1%    ~     (p=0.315 n=10+9)
Decoder_DecoderSmall/alice29.txt.zst-8                       344MB/s ± 1%   350MB/s ± 1%  +1.69%  (p=0.000 n=10+10)
Decoder_DecoderSmall/html_x_4.zst-8                         2.34GB/s ± 1%  2.37GB/s ± 1%  +1.10%  (p=0.000 n=10+10)
Decoder_DecoderSmall/paper-100k.pdf.zst-8                   3.75GB/s ± 0%  3.76GB/s ± 1%    ~     (p=0.182 n=9+10)
Decoder_DecoderSmall/fireworks.jpeg.zst-8                   8.59GB/s ± 1%  8.58GB/s ± 1%    ~     (p=0.842 n=10+9)
Decoder_DecoderSmall/urls.10K.zst-8                          561MB/s ± 1%   556MB/s ± 1%  -0.82%  (p=0.019 n=10+10)
Decoder_DecoderSmall/html.zst-8                              900MB/s ± 1%   913MB/s ± 1%  +1.42%  (p=0.000 n=10+9)
Decoder_DecoderSmall/comp-data.bin.zst-8                     399MB/s ± 1%   395MB/s ± 1%  -0.99%  (p=0.000 n=10+10)
Decoder_DecodeAll/kppkn.gtb.zst-8                            518MB/s ± 0%   526MB/s ± 0%  +1.52%  (p=0.000 n=10+9)
Decoder_DecodeAll/geo.protodata.zst-8                       1.28GB/s ± 0%  1.27GB/s ± 2%    ~     (p=0.739 n=10+10)
Decoder_DecodeAll/plrabn12.txt.zst-8                         427MB/s ± 1%   433MB/s ± 1%  +1.24%  (p=0.000 n=10+10)
Decoder_DecodeAll/lcet10.txt.zst-8                           480MB/s ± 1%   490MB/s ± 1%  +2.06%  (p=0.000 n=10+10)
Decoder_DecodeAll/asyoulik.txt.zst-8                         435MB/s ± 0%   447MB/s ± 0%  +2.70%  (p=0.000 n=7+9)
Decoder_DecodeAll/alice29.txt.zst-8                          422MB/s ± 0%   438MB/s ± 1%  +3.96%  (p=0.000 n=8+9)
Decoder_DecodeAll/html_x_4.zst-8                            1.60GB/s ± 0%  1.61GB/s ± 0%  +0.99%  (p=0.000 n=9+10)
Decoder_DecodeAll/paper-100k.pdf.zst-8                      4.55GB/s ± 1%  4.44GB/s ± 1%  -2.42%  (p=0.000 n=10+10)
Decoder_DecodeAll/fireworks.jpeg.zst-8                      9.52GB/s ± 1%  9.47GB/s ± 2%    ~     (p=0.143 n=10+10)
Decoder_DecodeAll/urls.10K.zst-8                             678MB/s ± 1%   684MB/s ± 0%  +0.83%  (p=0.000 n=10+10)
Decoder_DecodeAll/html.zst-8                                1.05GB/s ± 0%  1.07GB/s ± 1%  +2.11%  (p=0.000 n=10+10)
Decoder_DecodeAll/comp-data.bin.zst-8                        397MB/s ± 1%   391MB/s ± 1%  -1.37%  (p=0.000 n=10+10)
Decoder_DecodeAllFiles/Mark.Twain-Tom.Sawyer.txt/fastest-8   437MB/s ± 0%   436MB/s ± 1%  -0.21%  (p=0.025 n=9+9)
Decoder_DecodeAllFiles/Mark.Twain-Tom.Sawyer.txt/default-8   448MB/s ± 0%   451MB/s ± 0%  +0.70%  (p=0.000 n=9+9)
Decoder_DecodeAllFiles/Mark.Twain-Tom.Sawyer.txt/better-8    478MB/s ± 0%   475MB/s ± 0%  -0.53%  (p=0.000 n=10+10)
Decoder_DecodeAllFiles/Mark.Twain-Tom.Sawyer.txt/best-8      461MB/s ± 0%   470MB/s ± 0%  +2.07%  (p=0.000 n=8+9)
Decoder_DecodeAllFiles/e.txt/fastest-8                      9.62GB/s ± 3%  9.62GB/s ± 2%    ~     (p=1.000 n=10+10)
Decoder_DecodeAllFiles/e.txt/default-8                       391MB/s ± 0%   406MB/s ± 0%  +3.81%  (p=0.000 n=10+8)
Decoder_DecodeAllFiles/e.txt/better-8                        438MB/s ± 0%   448MB/s ± 0%  +2.39%  (p=0.000 n=8+10)
Decoder_DecodeAllFiles/e.txt/best-8                          500MB/s ± 0%   500MB/s ± 0%    ~     (p=0.119 n=9+9)
Decoder_DecodeAllFiles/fse-artifact3.bin/fastest-8          1.07GB/s ± 1%  1.04GB/s ± 1%  -2.61%  (p=0.000 n=10+10)
Decoder_DecodeAllFiles/fse-artifact3.bin/default-8          1.21GB/s ± 1%  1.19GB/s ± 1%  -1.33%  (p=0.000 n=10+10)
Decoder_DecodeAllFiles/fse-artifact3.bin/better-8            994MB/s ± 0%   990MB/s ± 0%  -0.42%  (p=0.002 n=10+9)
Decoder_DecodeAllFiles/fse-artifact3.bin/best-8              389MB/s ± 0%   381MB/s ± 0%  -2.00%  (p=0.000 n=8+10)
Decoder_DecodeAllFiles/gettysburg.txt/fastest-8              274MB/s ± 1%   274MB/s ± 1%    ~     (p=1.000 n=10+10)
Decoder_DecodeAllFiles/gettysburg.txt/default-8              224MB/s ± 1%   223MB/s ± 1%  -0.64%  (p=0.015 n=10+10)
Decoder_DecodeAllFiles/gettysburg.txt/better-8               228MB/s ± 1%   227MB/s ± 1%  -0.40%  (p=0.041 n=10+10)
Decoder_DecodeAllFiles/gettysburg.txt/best-8                 225MB/s ± 1%   223MB/s ± 0%  -0.52%  (p=0.008 n=10+6)
Decoder_DecodeAllFiles/html.txt/fastest-8                    599MB/s ± 1%   614MB/s ± 1%  +2.41%  (p=0.000 n=10+10)
Decoder_DecodeAllFiles/html.txt/default-8                    601MB/s ± 0%   613MB/s ± 0%  +2.01%  (p=0.000 n=8+9)
Decoder_DecodeAllFiles/html.txt/better-8                     626MB/s ± 1%   638MB/s ± 0%  +1.99%  (p=0.000 n=10+10)
Decoder_DecodeAllFiles/html.txt/best-8                       601MB/s ± 0%   612MB/s ± 0%  +1.87%  (p=0.000 n=10+10)
Decoder_DecodeAllFiles/pi.txt/fastest-8                     9.64GB/s ± 2%  9.66GB/s ± 1%    ~     (p=0.529 n=10+10)
Decoder_DecodeAllFiles/pi.txt/default-8                      390MB/s ± 0%   403MB/s ± 0%  +3.48%  (p=0.000 n=10+10)
Decoder_DecodeAllFiles/pi.txt/better-8                       439MB/s ± 0%   451MB/s ± 0%  +2.65%  (p=0.000 n=10+10)
Decoder_DecodeAllFiles/pi.txt/best-8                         500MB/s ± 0%   499MB/s ± 0%  -0.27%  (p=0.009 n=7+10)
Decoder_DecodeAllFiles/pngdata.bin/fastest-8                1.70GB/s ± 1%  1.69GB/s ± 1%  -0.63%  (p=0.013 n=10+9)
Decoder_DecodeAllFiles/pngdata.bin/default-8                1.52GB/s ± 1%  1.51GB/s ± 0%  -0.75%  (p=0.000 n=10+9)
Decoder_DecodeAllFiles/pngdata.bin/better-8                 1.92GB/s ± 0%  1.90GB/s ± 0%  -1.02%  (p=0.000 n=10+10)
Decoder_DecodeAllFiles/pngdata.bin/best-8                   1.47GB/s ± 0%  1.46GB/s ± 0%  -0.88%  (p=0.000 n=10+9)
Decoder_DecodeAllFiles/sharnd.out/fastest-8                 9.60GB/s ± 1%  9.67GB/s ± 1%  +0.67%  (p=0.029 n=10+10)
Decoder_DecodeAllFiles/sharnd.out/default-8                 9.65GB/s ± 2%  9.71GB/s ± 1%    ~     (p=0.353 n=10+10)
Decoder_DecodeAllFiles/sharnd.out/better-8                  9.67GB/s ± 1%  9.66GB/s ± 0%    ~     (p=0.549 n=10+9)
Decoder_DecodeAllFiles/sharnd.out/best-8                    9.70GB/s ± 1%  9.61GB/s ± 0%  -0.91%  (p=0.010 n=10+9)
[Geo mean]                                                   935MB/s        940MB/s       +0.57%
@klauspost
Copy link
Owner

klauspost commented Jul 9, 2022

Nice! I will do a few tests and merge if no problems show up.

@klauspost
Copy link
Owner

klauspost commented Jul 11, 2022

I remember adding this, but abandoning it, since it wasn't a clear result.

I reran the test with this patch, and the results are up and down:

benchmark                                                                                           old ns/op     new ns/op     delta
Benchmark_seqdec_decodeNoBMI/n-12286-lits-13914-prev-9869-1990358-3296656-win-4194304.blk-32        91683         88854         -3.09%
Benchmark_seqdec_decodeNoBMI/n-12485-lits-6960-prev-976039-2250252-2463561-win-4194304.blk-32       91226         84651         -7.21%
Benchmark_seqdec_decodeNoBMI/n-14746-lits-14461-prev-209-8-1379909-win-4194304.blk-32               100203        105549        +5.34%
Benchmark_seqdec_decodeNoBMI/n-1525-lits-1498-prev-2009476-797934-2994405-win-4194304.blk-32        9974          10640         +6.68%
Benchmark_seqdec_decodeNoBMI/n-3478-lits-3628-prev-895243-2104056-2119329-win-4194304.blk-32        23991         25967         +8.24%
Benchmark_seqdec_decodeNoBMI/n-8422-lits-5840-prev-168095-2298675-433830-win-4194304.blk-32         64748         60621         -6.37%
Benchmark_seqdec_decodeNoBMI/n-1000-lits-1057-prev-21887-92-217-win-8388608.blk-32                  6634          7498          +13.02%
Benchmark_seqdec_decodeNoBMI/n-15134-lits-20798-prev-4882976-4884216-4474622-win-8388608.blk-32     120044        114711        -4.44%
Benchmark_seqdec_decodeNoBMI/n-2-lits-0-prev-620601-689171-848-win-8388608.blk-32                   54.0          53.8          -0.26%
Benchmark_seqdec_decodeNoBMI/n-90-lits-67-prev-19498-23-19710-win-8388608.blk-32                    604           636           +5.30%
Benchmark_seqdec_decodeNoBMI/n-931-lits-1179-prev-36502-1526-1518-win-8388608.blk-32                6532          7049          +7.91%
Benchmark_seqdec_decodeNoBMI/n-2898-lits-4062-prev-335-386-751-win-8388608.blk-32                   19664         21733         +10.52%
Benchmark_seqdec_decodeNoBMI/n-4056-lits-12419-prev-10792-66-309849-win-8388608.blk-32              29075         30482         +4.84%
Benchmark_seqdec_decodeNoBMI/n-8028-lits-4568-prev-917-65-920-win-8388608.blk-32                    57341         58764         +2.48%
Benchmark_seqdec_decode/n-12286-lits-13914-prev-9869-1990358-3296656-win-4194304.blk-32             69308         69388         +0.12%
Benchmark_seqdec_decode/n-12485-lits-6960-prev-976039-2250252-2463561-win-4194304.blk-32            67437         63946         -5.18%
Benchmark_seqdec_decode/n-14746-lits-14461-prev-209-8-1379909-win-4194304.blk-32                    82492         82034         -0.56%
Benchmark_seqdec_decode/n-1525-lits-1498-prev-2009476-797934-2994405-win-4194304.blk-32             8324          8298          -0.31%
Benchmark_seqdec_decode/n-3478-lits-3628-prev-895243-2104056-2119329-win-4194304.blk-32             20671         20719         +0.23%
Benchmark_seqdec_decode/n-8422-lits-5840-prev-168095-2298675-433830-win-4194304.blk-32              47999         48290         +0.61%
Benchmark_seqdec_decode/n-1000-lits-1057-prev-21887-92-217-win-8388608.blk-32                       6041          5955          -1.42%
Benchmark_seqdec_decode/n-15134-lits-20798-prev-4882976-4884216-4474622-win-8388608.blk-32          92246         92660         +0.45%
Benchmark_seqdec_decode/n-2-lits-0-prev-620601-689171-848-win-8388608.blk-32                        59.3          59.4          +0.12%
Benchmark_seqdec_decode/n-90-lits-67-prev-19498-23-19710-win-8388608.blk-32                         517           516           -0.15%
Benchmark_seqdec_decode/n-931-lits-1179-prev-36502-1526-1518-win-8388608.blk-32                     5658          5707          +0.87%
Benchmark_seqdec_decode/n-2898-lits-4062-prev-335-386-751-win-8388608.blk-32                        17531         17585         +0.31%
Benchmark_seqdec_decode/n-4056-lits-12419-prev-10792-66-309849-win-8388608.blk-32                   25750         24288         -5.68%
Benchmark_seqdec_decode/n-8028-lits-4568-prev-917-65-920-win-8388608.blk-32                         46935         46549         -0.82%
Benchmark_seqdec_decodeSync/n-12286-lits-13914-prev-9869-1990358-3296656-win-4194304.blk-32         196428        195331        -0.56%
Benchmark_seqdec_decodeSync/n-12485-lits-6960-prev-976039-2250252-2463561-win-4194304.blk-32        180655        180105        -0.30%
Benchmark_seqdec_decodeSync/n-14746-lits-14461-prev-209-8-1379909-win-4194304.blk-32                126889        126775        -0.09%
Benchmark_seqdec_decodeSync/n-1525-lits-1498-prev-2009476-797934-2994405-win-4194304.blk-32         22715         23076         +1.59%
Benchmark_seqdec_decodeSync/n-3478-lits-3628-prev-895243-2104056-2119329-win-4194304.blk-32         61385         60102         -2.09%
Benchmark_seqdec_decodeSync/n-8422-lits-5840-prev-168095-2298675-433830-win-4194304.blk-32          127740        126585        -0.90%
Benchmark_seqdec_decodeSync/n-1000-lits-1057-prev-21887-92-217-win-8388608.blk-32                   11595         11631         +0.31%
Benchmark_seqdec_decodeSync/n-15134-lits-20798-prev-4882976-4884216-4474622-win-8388608.blk-32      150081        150425        +0.23%
Benchmark_seqdec_decodeSync/n-2-lits-0-prev-620601-689171-848-win-8388608.blk-32                    3978          3969          -0.23%
Benchmark_seqdec_decodeSync/n-90-lits-67-prev-19498-23-19710-win-8388608.blk-32                     4137          4118          -0.46%
Benchmark_seqdec_decodeSync/n-931-lits-1179-prev-36502-1526-1518-win-8388608.blk-32                 10638         10838         +1.88%
Benchmark_seqdec_decodeSync/n-2898-lits-4062-prev-335-386-751-win-8388608.blk-32                    28673         28375         -1.04%
Benchmark_seqdec_decodeSync/n-4056-lits-12419-prev-10792-66-309849-win-8388608.blk-32               57922         58084         +0.28%
Benchmark_seqdec_decodeSync/n-8028-lits-4568-prev-917-65-920-win-8388608.blk-32                     110068        110146        +0.07%

With BMI it is a win, but without it seems worse for quite a few cases.

Maybe using this for BMI and use the old method for x64 would be the best?

@greatroar
Copy link
Contributor Author

greatroar commented Jul 11, 2022

This patch doesn't change the BMI2 path. So using this for BMI2 is just dropping the patch :)

I see similar results on my machine: better on some benchmarks, worse on others. I find it hard to tell how these benchmarks relate to the more end-to-end Decode* benchmarks. Also, code size still goes down.

_seqdec_decode/n-12286-lits-13914-prev-9869-1990358-3296656-win-4194304.blk-8            150µs ± 1%     145µs ± 1%   -3.39%  (p=0.008 n=5+5)
_seqdec_decode/n-12485-lits-6960-prev-976039-2250252-2463561-win-4194304.blk-8           151µs ± 1%     137µs ± 1%   -9.46%  (p=0.008 n=5+5)
_seqdec_decode/n-14746-lits-14461-prev-209-8-1379909-win-4194304.blk-8                   177µs ± 3%     175µs ± 1%     ~     (p=0.222 n=5+5)
_seqdec_decode/n-1525-lits-1498-prev-2009476-797934-2994405-win-4194304.blk-8           17.4µs ± 2%    17.8µs ± 1%   +1.97%  (p=0.032 n=5+5)
_seqdec_decode/n-3478-lits-3628-prev-895243-2104056-2119329-win-4194304.blk-8           41.4µs ± 1%    42.6µs ± 1%   +2.74%  (p=0.008 n=5+5)
_seqdec_decode/n-8422-lits-5840-prev-168095-2298675-433830-win-4194304.blk-8             108µs ± 1%     101µs ± 2%   -6.31%  (p=0.008 n=5+5)
_seqdec_decode/n-1000-lits-1057-prev-21887-92-217-win-8388608.blk-8                     11.5µs ± 1%    12.3µs ± 2%   +6.83%  (p=0.008 n=5+5)
_seqdec_decode/n-15134-lits-20798-prev-4882976-4884216-4474622-win-8388608.blk-8         198µs ± 1%     188µs ± 2%   -5.09%  (p=0.008 n=5+5)
_seqdec_decode/n-2-lits-0-prev-620601-689171-848-win-8388608.blk-8                      78.5ns ± 1%    78.9ns ± 2%     ~     (p=0.548 n=5+5)
_seqdec_decode/n-90-lits-67-prev-19498-23-19710-win-8388608.blk-8                       1.04µs ± 3%    1.07µs ± 1%     ~     (p=0.063 n=5+5)
_seqdec_decode/n-931-lits-1179-prev-36502-1526-1518-win-8388608.blk-8                   11.3µs ± 1%    11.7µs ± 2%   +3.60%  (p=0.008 n=5+5)
_seqdec_decode/n-2898-lits-4062-prev-335-386-751-win-8388608.blk-8                      36.8µs ± 2%    35.5µs ± 1%   -3.42%  (p=0.008 n=5+5)
_seqdec_decode/n-4056-lits-12419-prev-10792-66-309849-win-8388608.blk-8                 53.4µs ± 2%    50.4µs ± 1%   -5.68%  (p=0.008 n=5+5)
_seqdec_decode/n-8028-lits-4568-prev-917-65-920-win-8388608.blk-8                        103µs ± 1%      96µs ± 1%   -6.45%  (p=0.008 n=5+5)
_seqdec_decodeSync/n-12286-lits-13914-prev-9869-1990358-3296656-win-4194304.blk-8        300µs ± 3%     302µs ± 1%     ~     (p=1.000 n=5+5)
_seqdec_decodeSync/n-12485-lits-6960-prev-976039-2250252-2463561-win-4194304.blk-8       286µs ± 1%     278µs ± 2%   -2.84%  (p=0.008 n=5+5)
_seqdec_decodeSync/n-14746-lits-14461-prev-209-8-1379909-win-4194304.blk-8               273µs ± 1%     253µs ± 3%   -7.24%  (p=0.008 n=5+5)
_seqdec_decodeSync/n-1525-lits-1498-prev-2009476-797934-2994405-win-4194304.blk-8       39.7µs ± 3%    40.5µs ± 1%     ~     (p=0.151 n=5+5)
_seqdec_decodeSync/n-3478-lits-3628-prev-895243-2104056-2119329-win-4194304.blk-8       88.8µs ± 1%    89.6µs ± 1%     ~     (p=0.079 n=5+5)
_seqdec_decodeSync/n-8422-lits-5840-prev-168095-2298675-433830-win-4194304.blk-8         202µs ± 2%     198µs ± 1%   -2.06%  (p=0.032 n=5+5)
_seqdec_decodeSync/n-1000-lits-1057-prev-21887-92-217-win-8388608.blk-8                 20.6µs ± 2%    19.9µs ± 2%   -3.67%  (p=0.016 n=5+5)
_seqdec_decodeSync/n-15134-lits-20798-prev-4882976-4884216-4474622-win-8388608.blk-8     272µs ± 2%     262µs ± 2%   -3.86%  (p=0.008 n=5+5)
_seqdec_decodeSync/n-2-lits-0-prev-620601-689171-848-win-8388608.blk-8                  5.60µs ± 2%    5.74µs ± 2%   +2.62%  (p=0.032 n=5+5)
_seqdec_decodeSync/n-90-lits-67-prev-19498-23-19710-win-8388608.blk-8                   6.73µs ± 2%    7.21µs ± 2%   +7.15%  (p=0.008 n=5+5)
_seqdec_decodeSync/n-931-lits-1179-prev-36502-1526-1518-win-8388608.blk-8               19.1µs ± 2%    20.3µs ± 2%   +6.52%  (p=0.008 n=5+5)
_seqdec_decodeSync/n-2898-lits-4062-prev-335-386-751-win-8388608.blk-8                  59.4µs ± 1%    57.7µs ± 1%   -2.78%  (p=0.008 n=5+5)
_seqdec_decodeSync/n-4056-lits-12419-prev-10792-66-309849-win-8388608.blk-8             94.0µs ± 1%    91.1µs ± 1%   -3.02%  (p=0.008 n=5+5)
_seqdec_decodeSync/n-8028-lits-4568-prev-917-65-920-win-8388608.blk-8                    179µs ± 1%     173µs ± 1%   -3.47%  (p=0.008 n=5+5)

@klauspost
Copy link
Owner

klauspost commented Jul 12, 2022

I wonder if something else is affecting the benchmarks, since all of the ones you posted are with bmi.

Only seqdec_decodeNoBMI is using non-bmi code, which is why I included it.

@greatroar
Copy link
Contributor Author

greatroar commented Jul 12, 2022

I should maybe have said this earlier, but my CPU does not have the BMI2 instructions.

@klauspost
Copy link
Owner

klauspost commented Jul 12, 2022

@greatroar Ah, ok :) Actually that is great since it is much more relevant that there is a speedup on your cpu than mine.

seqdec_decode microbenchmarks that particular piece of code, so differences are typically bigger in that than a full decode.

@klauspost klauspost merged commit 9a048c1 into klauspost:master Jul 12, 2022
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants