Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: loop invariant code motion #63670

Open
y1yang0 opened this issue Oct 22, 2023 · 12 comments
Open

cmd/compile: loop invariant code motion #63670

y1yang0 opened this issue Oct 22, 2023 · 12 comments
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance
Milestone

Comments

@y1yang0
Copy link
Contributor

y1yang0 commented Oct 22, 2023

Hi, several months ago I submit a patch to implement LICM(#59194), but in that patch the performance improvement was not compelling, after some research we think we need to hoist the high-yield instructions (Load/Store/etc) to achieve a positive performance improvement. Since these high-yield instructions are usually not speculative execution, we may need to implement a fulll LICM with the following dependencies:

  • LICM: hoist load/store/nilcheck etc
    • Loop Rotation: Transform while loop to do-while loop, it creates a home for hoistable Values
      • LCSSA: Loop closed SSA form, it ensures that all values defined inside the loop are only used within loop. This significantly simplifies the implementation of Loop Rotation and is the basis for future loop optimization such as loop unswitching and loop vectorization.
      • Def-Use Utilities
    • Type-based Alias Analysis: It ensures load/store are not pointing to the same location, thus we can safely hoist Load/Store/etc.

Overall, LCSSA opens the door to future loop optimizations and LICM is expected to have attractive performance improvements (I'll update this later), the current work is largely complete, if you're interested you can check out this POC.

Before submitting the complete PR, I'm looking forward to hearing some comments and suggestions from the community about it. Thank you in advance!

@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Oct 22, 2023
@y1yang0 y1yang0 closed this as completed Oct 22, 2023
@y1yang0 y1yang0 reopened this Oct 22, 2023
@dr2chase
Copy link
Contributor

Do you have a favorite reference for LCSSA? I understand what it's about from reading your patch (and I can see why we would want it) but there are also corner cases (multiple exits to different blocks, exits varying in loop level of target) and it would be nice to see that discussed. I tried searching, but mostly got references to LLVM and GCC internals.

Also, how far are you into a performance evaluation for this? Experience in the past (I also tried this) was uncompelling on x86, better but not inspiring for arm64 and ppc64. But things are changing; I'm working on (internal use only) pure and const calls that would allow commoning, hoisting, and dead code elimination, and arm64 is used much more now in data centers than five years ago.

@dr2chase dr2chase added NeedsFix The path to resolution is known, but the work has not been done. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Oct 23, 2023
@gopherbot gopherbot removed the NeedsFix The path to resolution is known, but the work has not been done. label Oct 23, 2023
@y1yang0
Copy link
Contributor Author

y1yang0 commented Oct 24, 2023

Do you have a favorite reference for LCSSA?

LLVM's doc is a good reference, maybe I can write some doc later either in go/doc or other place.

but there are also corner cases (multiple exits to different blocks, exits varying in loop level of target) and it would be nice to see that discussed.

Yes! This is the most challenging part.

image

  • For case 1, only one exit block, E1 doms use block, so we insert proxy phi at E1
  • For case2, all exits E1 and E2 dominates all predecessors of use block, insert proxy phi p1,p2 at E1 and E2 respectively, and insert yet another proxy phi p3 at use block to merge p1, p2.
  • For case 3 , use block is reachable from E1 and E2, but not E3, this is hard, maybe we can start from all loop exits(including inner loop exits) through dominance frontier and see if we can reach the use block, i.e. E1 and E2 has same dominance frontier, we insert proxy phi at there. This is hard, I'll give up now.
  • For case4: ditto.

Experimental data(building go toolchain itself) shows that 98.27% loops can apply LCSSA.

how far are you into a performance evaluation for this? Experience in the past (I also tried this) was uncompelling on x86

x86 benchmark result:

nohup: ignoring input
goos: linux
goarch: amd64
pkg: archive/tar
cpu: Intel(R) Xeon(R) Platinum 8269CY CPU @ 2.50GHz
                  │ old1023.log │           new1023-2.log            │
                  │   sec/op    │   sec/op     vs base               │
/Writer/USTAR-104   3.538µ ± 0%   3.796µ ± 2%  +7.29% (p=0.000 n=10)
/Writer/GNU-104     4.191µ ± 0%   4.571µ ± 0%  +9.07% (p=0.000 n=10)
/Writer/PAX-104     7.134µ ± 0%   7.837µ ± 0%  +9.85% (p=0.000 n=10)
/Reader/USTAR-104   2.958µ ± 0%   2.968µ ± 0%  +0.34% (p=0.005 n=10)
/Reader/GNU-104     1.721µ ± 0%   1.842µ ± 5%  +7.06% (p=0.000 n=10)
/Reader/PAX-104     6.280µ ± 0%   6.362µ ± 3%  +1.31% (p=0.000 n=10)
geomean             3.874µ        4.097µ       +5.76%

                  │ old1023.log  │             new1023-2.log             │
                  │     B/op     │     B/op      vs base                 │
/Writer/USTAR-104   1.164Ki ± 0%   1.164Ki ± 0%       ~ (p=1.000 n=10) ¹
/Writer/GNU-104     1.390Ki ± 0%   1.390Ki ± 0%       ~ (p=1.000 n=10) ¹
/Writer/PAX-104     2.188Ki ± 0%   2.188Ki ± 0%       ~ (p=1.000 n=10) ¹
/Reader/USTAR-104     985.0 ± 0%     985.0 ± 0%       ~ (p=0.628 n=10)
/Reader/GNU-104       977.0 ± 0%     977.0 ± 0%       ~ (p=1.000 n=10) ¹
/Reader/PAX-104     2.522Ki ± 0%   2.522Ki ± 0%       ~ (p=0.546 n=10)
geomean             1.420Ki        1.420Ki       +0.00%
¹ all samples are equal

                  │ old1023.log │            new1023-2.log            │
                  │  allocs/op  │ allocs/op   vs base                 │
/Writer/USTAR-104    25.00 ± 0%   25.00 ± 0%       ~ (p=1.000 n=10) ¹
/Writer/GNU-104      34.00 ± 0%   34.00 ± 0%       ~ (p=1.000 n=10) ¹
/Writer/PAX-104      57.00 ± 0%   57.00 ± 0%       ~ (p=1.000 n=10) ¹
/Reader/USTAR-104    15.00 ± 0%   15.00 ± 0%       ~ (p=1.000 n=10) ¹
/Reader/GNU-104      14.00 ± 0%   14.00 ± 0%       ~ (p=1.000 n=10) ¹
/Reader/PAX-104      32.00 ± 0%   32.00 ± 0%       ~ (p=1.000 n=10) ¹
geomean              26.23        26.23       +0.00%
¹ all samples are equal

pkg: archive/zip
                            │  old1023.log  │            new1023-2.log            │
                            │    sec/op     │    sec/op     vs base               │
CompressedZipGarbage-104      183.3µ ±   3%   177.9µ ±  0%  -2.97% (p=0.000 n=10)
Zip64Test-104                 10.53m ±   0%   10.36m ±  0%  -1.64% (p=0.000 n=10)
Zip64TestSizes/4096-104       9.242µ ± 247%   9.771µ ± 47%       ~ (p=0.143 n=10)
Zip64TestSizes/1048576-104    37.93µ ±   2%   36.94µ ±  1%  -2.60% (p=0.000 n=10)
Zip64TestSizes/67108864-104   414.5µ ±   3%   406.1µ ±  2%  -2.02% (p=0.019 n=10)
geomean                       194.8µ          193.3µ        -0.76%

                         │  old1023.log  │         new1023-2.log          │
                         │     B/op      │     B/op      vs base          │
CompressedZipGarbage-104   6.306Ki ± 60%   5.630Ki ± 9%  ~ (p=0.100 n=10)

                         │ old1023.log │         new1023-2.log          │
                         │  allocs/op  │ allocs/op   vs base            │
CompressedZipGarbage-104    43.00 ± 0%   43.00 ± 0%  ~ (p=1.000 n=10) ¹
¹ all samples are equal

pkg: container/heap
        │ old1023.log │           new1023-2.log            │
        │   sec/op    │   sec/op     vs base               │
Dup-104   251.0µ ± 0%   254.6µ ± 0%  +1.40% (p=0.000 n=10)

pkg: encoding/asn1
                           │ old1023.log │           new1023-2.log            │
                           │   sec/op    │   sec/op     vs base               │
ObjectIdentifierString-104   121.5n ± 0%   119.1n ± 2%  -1.98% (p=0.000 n=10)
Marshal-104                  27.72µ ± 0%   28.12µ ± 1%  +1.44% (p=0.000 n=10)
Unmarshal-104                5.603µ ± 0%   5.803µ ± 0%  +3.58% (p=0.000 n=10)
geomean                      2.662µ        2.688µ       +0.99%

              │ old1023.log  │             new1023-2.log             │
              │     B/op     │     B/op      vs base                 │
Marshal-104     8.070Ki ± 0%   8.070Ki ± 0%       ~ (p=1.000 n=10) ¹
Unmarshal-104     488.0 ± 0%     488.0 ± 0%       ~ (p=1.000 n=10) ¹
geomean         1.961Ki        1.961Ki       +0.00%
¹ all samples are equal

              │ old1023.log │            new1023-2.log            │
              │  allocs/op  │ allocs/op   vs base                 │
Marshal-104      363.0 ± 0%   363.0 ± 0%       ~ (p=1.000 n=10) ¹
Unmarshal-104    43.00 ± 0%   43.00 ± 0%       ~ (p=1.000 n=10) ¹
geomean          124.9        124.9       +0.00%
¹ all samples are equal

pkg: encoding/base32
                   │ old1023.log  │           new1023-2.log            │
                   │    sec/op    │   sec/op     vs base               │
Encode-104           9.383µ ±  0%   9.649µ ± 0%  +2.83% (p=0.000 n=10)
EncodeToString-104   14.05µ ± 10%   14.10µ ± 2%       ~ (p=0.739 n=10)
Decode-104           42.05µ ±  0%   45.02µ ± 0%  +7.06% (p=0.000 n=10)
DecodeString-104     41.54µ ±  0%   43.25µ ± 0%  +4.13% (p=0.000 n=10)
geomean              21.91µ         22.69µ       +3.58%

                   │ old1023.log  │            new1023-2.log            │
                   │     B/s      │     B/s       vs base               │
Encode-104           832.6Mi ± 0%   809.7Mi ± 0%  -2.74% (p=0.000 n=10)
EncodeToString-104   556.1Mi ± 9%   553.9Mi ± 2%       ~ (p=0.739 n=10)
Decode-104           297.4Mi ± 0%   277.7Mi ± 0%  -6.60% (p=0.000 n=10)
DecodeString-104     301.0Mi ± 0%   289.1Mi ± 0%  -3.96% (p=0.000 n=10)
geomean              451.2Mi        435.6Mi       -3.45%

pkg: encoding/base64
                      │ old1023.log  │           new1023-2.log            │
                      │    sec/op    │   sec/op     vs base               │
EncodeToString-104      13.69µ ± 17%   13.62µ ± 0%       ~ (p=0.052 n=10)
DecodeString/2-104      34.70n ±  0%   36.25n ± 1%  +4.45% (p=0.000 n=10)
DecodeString/4-104      39.12n ±  0%   38.33n ± 0%  -2.04% (p=0.001 n=10)
DecodeString/8-104      47.27n ±  3%   46.28n ± 0%  -2.09% (p=0.001 n=10)
DecodeString/64-104     133.9n ±  0%   131.5n ± 0%  -1.79% (p=0.000 n=10)
DecodeString/8192-104   11.98µ ±  0%   11.78µ ± 2%  -1.72% (p=0.012 n=10)
NewEncoding-104         168.3n ±  0%   172.5n ± 0%  +2.50% (p=0.000 n=10)
geomean                 303.5n         302.9n       -0.20%

                      │  old1023.log  │            new1023-2.log            │
                      │      B/s      │     B/s       vs base               │
EncodeToString-104      570.8Mi ± 14%   573.8Mi ± 0%       ~ (p=0.052 n=10)
DecodeString/2-104      109.9Mi ±  0%   105.2Mi ± 1%  -4.28% (p=0.000 n=10)
DecodeString/4-104      195.0Mi ±  0%   199.1Mi ± 0%  +2.08% (p=0.001 n=10)
DecodeString/8-104      242.1Mi ±  3%   247.3Mi ± 0%  +2.15% (p=0.001 n=10)
DecodeString/64-104     626.9Mi ±  0%   638.3Mi ± 0%  +1.81% (p=0.000 n=10)
DecodeString/8192-104   869.4Mi ±  0%   884.6Mi ± 2%  +1.75% (p=0.011 n=10)
NewEncoding-104         1.417Gi ±  0%   1.382Gi ± 0%  -2.44% (p=0.000 n=10)
geomean                 421.0Mi         421.8Mi       +0.20%

pkg: encoding/binary
                             │ old1023.log  │            new1023-2.log             │
                             │    sec/op    │    sec/op     vs base                │
ReadSlice1000Int32s-104         4.848µ ± 0%    4.853µ ± 0%        ~ (p=0.135 n=10)
ReadStruct-104                  381.9n ± 0%    381.2n ± 0%   -0.21% (p=0.001 n=10)
WriteStruct-104                 349.7n ± 0%    355.2n ± 1%   +1.57% (p=0.000 n=10)
ReadInts-104                    271.8n ± 0%    249.8n ± 0%   -8.09% (p=0.000 n=10)
WriteInts-104                   218.7n ± 0%    218.4n ± 1%        ~ (p=0.071 n=10)
WriteSlice1000Int32s-104        4.977µ ± 0%    4.673µ ± 0%   -6.11% (p=0.000 n=10)
PutUint16-104                  0.7312n ± 0%   0.6261n ± 0%  -14.36% (p=0.000 n=10)
AppendUint16-104                1.294n ± 5%    1.542n ± 0%  +19.20% (p=0.000 n=10)
PutUint32-104                  0.6262n ± 0%   0.6262n ± 0%        ~ (p=0.497 n=10)
AppendUint32-104                1.259n ± 0%    1.543n ± 0%  +22.60% (p=0.000 n=10)
PutUint64-104                  0.7826n ± 0%   0.7305n ± 0%   -6.66% (p=0.000 n=10)
AppendUint64-104                1.254n ± 0%    1.568n ± 0%  +25.04% (p=0.000 n=10)
LittleEndianPutUint16-104      0.5665n ± 1%   0.6262n ± 0%  +10.53% (p=0.000 n=10)
LittleEndianAppendUint16-104    1.324n ± 1%    1.354n ± 1%   +2.30% (p=0.000 n=10)
LittleEndianPutUint32-104      0.5575n ± 0%   0.6260n ± 0%  +12.30% (p=0.000 n=10)
LittleEndianAppendUint32-104    1.352n ± 6%    1.355n ± 1%        ~ (p=0.516 n=10)
LittleEndianPutUint64-104      0.5523n ± 0%   0.6260n ± 0%  +13.35% (p=0.000 n=10)
LittleEndianAppendUint64-104    1.329n ± 8%    1.318n ± 1%        ~ (p=0.617 n=10)
ReadFloats-104                  71.08n ± 0%    66.14n ± 0%   -6.95% (p=0.000 n=10)
WriteFloats-104                 57.54n ± 0%    58.08n ± 0%   +0.95% (p=0.000 n=10)
ReadSlice1000Float32s-104       4.853µ ± 0%    5.005µ ± 0%   +3.15% (p=0.000 n=10)
WriteSlice1000Float32s-104      5.131µ ± 0%    4.822µ ± 0%   -6.00% (p=0.000 n=10)
ReadSlice1000Uint8s-104         291.6n ± 1%    291.8n ± 0%        ~ (p=0.697 n=10)
WriteSlice1000Uint8s-104        286.6n ± 0%    285.1n ± 0%   -0.54% (p=0.014 n=10)
PutUvarint32-104                16.89n ± 1%    15.50n ± 0%   -8.23% (p=0.000 n=10)
PutUvarint64-104                52.83n ± 1%    47.58n ± 0%   -9.94% (p=0.000 n=10)
geomean                         23.49n         23.78n        +1.21%

                             │ old1023.log  │             new1023-2.log             │
                             │     B/s      │      B/s       vs base                │
ReadSlice1000Int32s-104        786.8Mi ± 0%    786.1Mi ± 0%        ~ (p=0.089 n=10)
ReadStruct-104                 187.3Mi ± 0%    187.7Mi ± 0%   +0.22% (p=0.001 n=10)
WriteStruct-104                204.5Mi ± 0%    201.4Mi ± 1%   -1.54% (p=0.000 n=10)
ReadInts-104                   105.3Mi ± 0%    114.5Mi ± 0%   +8.82% (p=0.000 n=10)
WriteInts-104                  130.8Mi ± 0%    131.0Mi ± 1%        ~ (p=0.075 n=10)
WriteSlice1000Int32s-104       766.4Mi ± 0%    816.4Mi ± 0%   +6.51% (p=0.000 n=10)
PutUint16-104                  2.547Gi ± 0%    2.975Gi ± 0%  +16.77% (p=0.000 n=10)
AppendUint16-104               1.439Gi ± 5%    1.207Gi ± 0%  -16.11% (p=0.000 n=10)
PutUint32-104                  5.949Gi ± 0%    5.949Gi ± 0%        ~ (p=0.481 n=10)
AppendUint32-104               2.959Gi ± 0%    2.414Gi ± 0%  -18.43% (p=0.000 n=10)
PutUint64-104                  9.520Gi ± 0%   10.199Gi ± 0%   +7.13% (p=0.000 n=10)
AppendUint64-104               5.939Gi ± 0%    4.752Gi ± 0%  -19.99% (p=0.000 n=10)
LittleEndianPutUint16-104      3.288Gi ± 1%    2.975Gi ± 0%   -9.52% (p=0.000 n=10)
LittleEndianAppendUint16-104   1.407Gi ± 1%    1.376Gi ± 1%   -2.26% (p=0.000 n=10)
LittleEndianPutUint32-104      6.682Gi ± 0%    5.950Gi ± 0%  -10.95% (p=0.000 n=10)
LittleEndianAppendUint32-104   2.757Gi ± 6%    2.751Gi ± 1%        ~ (p=0.529 n=10)
LittleEndianPutUint64-104      13.49Gi ± 0%    11.90Gi ± 0%  -11.78% (p=0.000 n=10)
LittleEndianAppendUint64-104   5.604Gi ± 7%    5.655Gi ± 1%        ~ (p=0.631 n=10)
ReadFloats-104                 161.0Mi ± 0%    173.0Mi ± 0%   +7.47% (p=0.000 n=10)
WriteFloats-104                198.9Mi ± 0%    197.0Mi ± 0%   -0.95% (p=0.000 n=10)
ReadSlice1000Float32s-104      786.1Mi ± 0%    762.1Mi ± 0%   -3.05% (p=0.000 n=10)
WriteSlice1000Float32s-104     743.5Mi ± 0%    791.1Mi ± 0%   +6.39% (p=0.000 n=10)
ReadSlice1000Uint8s-104        3.194Gi ± 1%    3.192Gi ± 0%        ~ (p=0.684 n=10)
WriteSlice1000Uint8s-104       3.249Gi ± 0%    3.266Gi ± 0%   +0.53% (p=0.019 n=10)
PutUvarint32-104               225.9Mi ± 1%    246.1Mi ± 0%   +8.94% (p=0.000 n=10)
PutUvarint64-104               144.4Mi ± 1%    160.3Mi ± 0%  +11.02% (p=0.000 n=10)
geomean                        1.147Gi         1.134Gi        -1.20%

pkg: encoding/csv
                                          │ old1023.log │            new1023-2.log            │
                                          │   sec/op    │    sec/op     vs base               │
Read-104                                    2.054µ ± 1%   2.082µ ±  0%  +1.36% (p=0.017 n=10)
ReadWithFieldsPerRecord-104                 2.052µ ± 0%   2.074µ ±  1%  +1.05% (p=0.003 n=10)
ReadWithoutFieldsPerRecord-104              2.053µ ± 0%   2.067µ ±  2%       ~ (p=0.142 n=10)
ReadLargeFields-104                         3.671µ ± 3%   3.708µ ±  0%       ~ (p=0.224 n=10)
ReadReuseRecord-104                         1.231µ ± 0%   1.200µ ±  0%  -2.52% (p=0.000 n=10)
ReadReuseRecordWithFieldsPerRecord-104      1.232µ ± 0%   1.200µ ±  0%  -2.60% (p=0.000 n=10)
ReadReuseRecordWithoutFieldsPerRecord-104   1.233µ ± 0%   1.202µ ±  0%  -2.55% (p=0.000 n=10)
ReadReuseRecordLargeFields-104              2.678µ ± 1%   2.707µ ±  1%  +1.10% (p=0.000 n=10)
Write-104                                   1.661µ ± 0%   1.734µ ± 24%  +4.36% (p=0.004 n=10)
geomean                                     1.858µ        1.862µ        +0.19%

                                          │ old1023.log  │             new1023-2.log             │
                                          │     B/op     │     B/op      vs base                 │
Read-104                                      664.0 ± 0%     664.0 ± 0%       ~ (p=1.000 n=10) ¹
ReadWithFieldsPerRecord-104                   664.0 ± 0%     664.0 ± 0%       ~ (p=1.000 n=10) ¹
ReadWithoutFieldsPerRecord-104                664.0 ± 0%     664.0 ± 0%       ~ (p=1.000 n=10) ¹
ReadLargeFields-104                         3.844Ki ± 0%   3.844Ki ± 0%       ~ (p=1.000 n=10) ¹
ReadReuseRecord-104                           24.00 ± 0%     24.00 ± 0%       ~ (p=1.000 n=10) ¹
ReadReuseRecordWithFieldsPerRecord-104        24.00 ± 0%     24.00 ± 0%       ~ (p=1.000 n=10) ¹
ReadReuseRecordWithoutFieldsPerRecord-104     24.00 ± 0%     24.00 ± 0%       ~ (p=1.000 n=10) ¹
ReadReuseRecordLargeFields-104              2.906Ki ± 0%   2.906Ki ± 0%       ~ (p=1.000 n=10) ¹
geomean                                       288.1          288.1       +0.00%
¹ all samples are equal

                                          │ old1023.log │            new1023-2.log            │
                                          │  allocs/op  │ allocs/op   vs base                 │
Read-104                                     16.00 ± 0%   16.00 ± 0%       ~ (p=1.000 n=10) ¹
ReadWithFieldsPerRecord-104                  16.00 ± 0%   16.00 ± 0%       ~ (p=1.000 n=10) ¹
ReadWithoutFieldsPerRecord-104               16.00 ± 0%   16.00 ± 0%       ~ (p=1.000 n=10) ¹
ReadLargeFields-104                          24.00 ± 0%   24.00 ± 0%       ~ (p=1.000 n=10) ¹
ReadReuseRecord-104                          6.000 ± 0%   6.000 ± 0%       ~ (p=1.000 n=10) ¹
ReadReuseRecordWithFieldsPerRecord-104       6.000 ± 0%   6.000 ± 0%       ~ (p=1.000 n=10) ¹
ReadReuseRecordWithoutFieldsPerRecord-104    6.000 ± 0%   6.000 ± 0%       ~ (p=1.000 n=10) ¹
ReadReuseRecordLargeFields-104               12.00 ± 0%   12.00 ± 0%       ~ (p=1.000 n=10) ¹
geomean                                      11.24        11.24       +0.00%
¹ all samples are equal

pkg: encoding/gob
                            │ old1023.log  │            new1023-2.log            │
                            │    sec/op    │    sec/op     vs base               │
EndToEndPipe-104              712.0n ±  6%   710.0n ± 15%       ~ (p=0.631 n=10)
EndToEndByteBuffer-104        620.9n ±  2%   610.0n ±  1%  -1.76% (p=0.014 n=10)
EndToEndSliceByteBuffer-104   8.193µ ±  1%   8.306µ ±  1%  +1.39% (p=0.001 n=10)
EncodeComplex128Slice-104     312.4n ±  2%   311.3n ±  1%       ~ (p=0.190 n=10)
EncodeFloat64Slice-104        139.1n ±  1%   140.0n ±  1%  +0.72% (p=0.003 n=10)
EncodeInt32Slice-104          154.8n ±  1%   156.0n ±  1%  +0.81% (p=0.034 n=10)
EncodeStringSlice-104         148.6n ±  1%   144.8n ±  2%  -2.52% (p=0.000 n=10)
EncodeInterfaceSlice-104      7.485µ ±  2%   7.476µ ±  4%       ~ (p=0.839 n=10)
DecodeComplex128Slice-104     13.14µ ± 50%   12.96µ ± 51%       ~ (p=1.000 n=10)
DecodeFloat64Slice-104        5.428µ ±  1%   5.559µ ±  1%  +2.42% (p=0.000 n=10)
DecodeInt32Slice-104          4.737µ ±  2%   4.799µ ±  2%  +1.32% (p=0.037 n=10)
DecodeStringSlice-104         15.21µ ±  1%   15.25µ ±  1%       ~ (p=0.853 n=10)
DecodeStringsSlice-104        27.67µ ±  0%   27.87µ ±  1%  +0.72% (p=0.000 n=10)
DecodeBytesSlice-104          7.045µ ±  1%   7.042µ ±  1%       ~ (p=0.971 n=10)
DecodeInterfaceSlice-104      43.37µ ±  1%   44.27µ ±  0%  +2.09% (p=0.000 n=10)
DecodeMap-104                 149.1µ ±  1%   150.8µ ±  1%  +1.16% (p=0.005 n=10)
geomean                       3.275µ         3.284µ        +0.27%

                            │   old1023.log   │             new1023-2.log              │
                            │      B/op       │     B/op       vs base                 │
EndToEndPipe-104              1.767Ki ±  0%     1.767Ki ±  0%       ~ (p=1.000 n=10) ¹
EndToEndByteBuffer-104        1.766Ki ±  0%     1.766Ki ±  0%       ~ (p=1.000 n=10) ¹
EndToEndSliceByteBuffer-104   13.47Ki ±  0%     13.47Ki ±  0%  +0.01% (p=0.031 n=10)
EncodeComplex128Slice-104       3.000 ± 33%       2.000 ± 50%       ~ (p=0.170 n=10)
EncodeFloat64Slice-104          0.000 ±  0%       0.000 ±  0%       ~ (p=1.000 n=10) ¹
EncodeInt32Slice-104            0.000 ±  0%       0.000 ±  0%       ~ (p=1.000 n=10) ¹
EncodeStringSlice-104           1.000 ±  0%       1.000 ±  0%       ~ (p=1.000 n=10) ¹
EncodeInterfaceSlice-104        92.00 ±  3%       91.50 ±  3%       ~ (p=0.402 n=10)
DecodeComplex128Slice-104     24.38Ki ±  0%     24.38Ki ±  0%       ~ (p=0.467 n=10)
DecodeFloat64Slice-104        10.37Ki ±  0%     10.37Ki ±  0%       ~ (p=1.000 n=10) ¹
DecodeInt32Slice-104          9.347Ki ±  0%     9.347Ki ±  0%       ~ (p=1.000 n=10)
DecodeStringSlice-104         38.01Ki ±  0%     38.01Ki ±  0%       ~ (p=1.000 n=10)
DecodeStringsSlice-104        63.91Ki ±  0%     63.91Ki ±  0%       ~ (p=0.142 n=10)
DecodeBytesSlice-104          22.44Ki ±  0%     22.44Ki ±  0%       ~ (p=0.430 n=10)
DecodeInterfaceSlice-104      80.30Ki ±  0%     80.30Ki ±  0%       ~ (p=0.051 n=10)
DecodeMap-104                 52.67Ki ±  0%     52.67Ki ±  0%       ~ (p=0.973 n=10)
geomean                                     ²                  -2.53%                ²
¹ all samples are equal
² summaries must be >0 to compute geomean

                            │  old1023.log  │            new1023-2.log             │
                            │   allocs/op   │  allocs/op   vs base                 │
EndToEndPipe-104               2.000 ± 0%      2.000 ± 0%       ~ (p=1.000 n=10) ¹
EndToEndByteBuffer-104         2.000 ± 0%      2.000 ± 0%       ~ (p=1.000 n=10) ¹
EndToEndSliceByteBuffer-104    302.0 ± 0%      302.0 ± 0%       ~ (p=1.000 n=10) ¹
EncodeComplex128Slice-104      0.000 ± 0%      0.000 ± 0%       ~ (p=1.000 n=10) ¹
EncodeFloat64Slice-104         0.000 ± 0%      0.000 ± 0%       ~ (p=1.000 n=10) ¹
EncodeInt32Slice-104           0.000 ± 0%      0.000 ± 0%       ~ (p=1.000 n=10) ¹
EncodeStringSlice-104          0.000 ± 0%      0.000 ± 0%       ~ (p=1.000 n=10) ¹
EncodeInterfaceSlice-104       0.000 ± 0%      0.000 ± 0%       ~ (p=1.000 n=10) ¹
DecodeComplex128Slice-104      169.0 ± 0%      169.0 ± 0%       ~ (p=1.000 n=10) ¹
DecodeFloat64Slice-104         170.0 ± 0%      170.0 ± 0%       ~ (p=1.000 n=10) ¹
DecodeInt32Slice-104           169.0 ± 0%      169.0 ± 0%       ~ (p=1.000 n=10) ¹
DecodeStringSlice-104         1.170k ± 0%     1.170k ± 0%       ~ (p=1.000 n=10) ¹
DecodeStringsSlice-104        2.182k ± 0%     2.183k ± 0%       ~ (p=0.141 n=10)
DecodeBytesSlice-104           171.0 ± 0%      171.0 ± 0%       ~ (p=1.000 n=10)
DecodeInterfaceSlice-104      3.178k ± 0%     3.178k ± 0%       ~ (p=1.000 n=10) ¹
DecodeMap-104                  181.0 ± 0%      181.0 ± 0%       ~ (p=1.000 n=10) ¹
geomean                                   ²                +0.00%                ²
¹ all samples are equal
² summaries must be >0 to compute geomean

pkg: encoding/hex
                 │ old1023.log │            new1023-2.log            │
                 │   sec/op    │   sec/op     vs base                │
Encode/256-104     330.9n ± 0%   330.4n ± 0%        ~ (p=0.233 n=10)
Encode/1024-104    1.293µ ± 0%   1.292µ ± 0%   -0.08% (p=0.000 n=10)
Encode/4096-104    5.140µ ± 0%   5.141µ ± 0%        ~ (p=0.524 n=10)
Encode/16384-104   20.54µ ± 0%   20.54µ ± 0%   +0.03% (p=0.024 n=10)
Decode/256-104     190.9n ± 0%   211.8n ± 0%  +10.95% (p=0.000 n=10)
Decode/1024-104    730.4n ± 0%   812.9n ± 0%  +11.29% (p=0.000 n=10)
Decode/4096-104    2.896µ ± 0%   3.218µ ± 0%  +11.12% (p=0.000 n=10)
Decode/16384-104   11.56µ ± 0%   12.84µ ± 0%  +11.08% (p=0.000 n=10)
Dump/256-104       3.692µ ± 0%   3.580µ ± 5%        ~ (p=0.468 n=10)
Dump/1024-104      14.27µ ± 0%   13.84µ ± 1%   -3.00% (p=0.000 n=10)
Dump/4096-104      55.67µ ± 0%   54.13µ ± 0%   -2.76% (p=0.000 n=10)
Dump/16384-104     220.9µ ± 0%   215.1µ ± 0%   -2.65% (p=0.000 n=10)
geomean            4.764µ        4.886µ        +2.56%

                 │ old1023.log  │            new1023-2.log             │
                 │     B/s      │     B/s       vs base                │
Encode/256-104     737.8Mi ± 0%   739.0Mi ± 0%        ~ (p=0.271 n=10)
Encode/1024-104    755.3Mi ± 0%   755.8Mi ± 0%   +0.06% (p=0.000 n=10)
Encode/4096-104    759.9Mi ± 0%   759.9Mi ± 0%        ~ (p=0.137 n=10)
Encode/16384-104   760.8Mi ± 0%   760.6Mi ± 0%   -0.03% (p=0.037 n=10)
Decode/256-104     1.249Gi ± 0%   1.126Gi ± 0%   -9.87% (p=0.000 n=10)
Decode/1024-104    1.306Gi ± 0%   1.173Gi ± 0%  -10.15% (p=0.000 n=10)
Decode/4096-104    1.317Gi ± 0%   1.185Gi ± 0%  -10.00% (p=0.000 n=10)
Decode/16384-104   1.320Gi ± 0%   1.188Gi ± 0%   -9.97% (p=0.000 n=10)
Dump/256-104       66.13Mi ± 0%   68.21Mi ± 4%        ~ (p=0.469 n=10)
Dump/1024-104      68.43Mi ± 0%   70.55Mi ± 1%   +3.09% (p=0.000 n=10)
Dump/4096-104      70.17Mi ± 0%   72.16Mi ± 0%   +2.84% (p=0.000 n=10)
Dump/16384-104     70.72Mi ± 0%   72.64Mi ± 0%   +2.71% (p=0.000 n=10)
geomean            410.0Mi        399.8Mi        -2.50%

pkg: encoding/json
                                     │  old1023.log  │             new1023-2.log             │
                                     │    sec/op     │    sec/op      vs base                │
CodeEncoder-104                         142.9µ ±  1%    153.9µ ±  3%   +7.69% (p=0.000 n=10)
CodeEncoderError-104                    148.6µ ±  0%    158.2µ ±  3%   +6.44% (p=0.000 n=10)
CodeMarshal-104                         172.6µ ±  3%    175.7µ ±  4%        ~ (p=0.190 n=10)
CodeMarshalError-104                    193.2µ ±  1%    198.9µ ±  2%   +2.94% (p=0.000 n=10)
MarshalBytes/32-104                     227.2n ±  1%    223.8n ±  0%   -1.50% (p=0.000 n=10)
MarshalBytes/256-104                    612.1n ±  1%    607.9n ±  1%        ~ (p=0.165 n=10)
MarshalBytes/4096-104                   7.116µ ±  1%    7.130µ ±  1%        ~ (p=0.926 n=10)
MarshalBytesError/32-104                209.1µ ±  3%    205.7µ ±  2%        ~ (p=0.143 n=10)
MarshalBytesError/256-104               205.7µ ±  2%    203.6µ ±  1%        ~ (p=0.063 n=10)
MarshalBytesError/4096-104              216.0µ ±  1%    211.2µ ±  0%   -2.25% (p=0.000 n=10)
MarshalMap-104                          116.9n ±  1%    120.2n ±  1%   +2.82% (p=0.000 n=10)
CodeDecoder-104                         627.8µ ±  2%    641.1µ ±  1%   +2.11% (p=0.004 n=10)
UnicodeDecoder-104                      270.5n ±  0%    270.1n ±  0%        ~ (p=0.085 n=10)
DecoderStream-104                       202.5n ±  0%    201.9n ±  0%   -0.30% (p=0.001 n=10)
CodeUnmarshal-104                       962.8µ ±  2%    995.6µ ±  2%   +3.41% (p=0.002 n=10)
CodeUnmarshalReuse-104                  696.3µ ±  1%    715.5µ ±  2%   +2.76% (p=0.002 n=10)
UnmarshalString-104                     60.68n ±  0%    61.20n ±  0%   +0.86% (p=0.000 n=10)
UnmarshalFloat64-104                    54.83n ±  0%    56.03n ±  1%   +2.19% (p=0.000 n=10)
UnmarshalInt64-104                      53.90n ±  1%    54.58n ±  1%   +1.25% (p=0.002 n=10)
UnmarshalMap-104                        132.8n ±  0%    133.6n ±  0%   +0.64% (p=0.000 n=10)
Issue10335-104                          75.52n ±  1%    76.77n ±  0%   +1.66% (p=0.000 n=10)
Issue34127-104                          19.28n ±  2%    18.84n ±  1%   -2.26% (p=0.001 n=10)
Unmapped-104                            114.5n ±  1%    111.8n ±  1%   -2.44% (p=0.000 n=10)
TypeFieldsCache/MissTypes1-104          95.55µ ± 20%    92.57µ ± 14%        ~ (p=0.579 n=10)
TypeFieldsCache/MissTypes10-104         94.53µ ±  7%    97.79µ ±  3%   +3.45% (p=0.009 n=10)
TypeFieldsCache/MissTypes100-104        444.6µ ±  1%    447.3µ ±  2%        ~ (p=0.218 n=10)
TypeFieldsCache/MissTypes1000-104       2.501m ±  2%    2.519m ±  1%        ~ (p=0.481 n=10)
TypeFieldsCache/MissTypes10000-104      22.35m ±  3%    22.05m ±  3%        ~ (p=0.105 n=10)
TypeFieldsCache/MissTypes100000-104     225.4m ±  3%    225.3m ±  3%        ~ (p=0.631 n=10)
TypeFieldsCache/MissTypes1000000-104     2.283 ±  8%     2.263 ±  6%        ~ (p=0.315 n=10)
TypeFieldsCache/HitTypes1-104          0.4607n ±  0%   0.4871n ±  1%   +5.74% (p=0.000 n=10)
TypeFieldsCache/HitTypes10-104         0.4612n ±  0%   0.4625n ±  1%   +0.27% (p=0.001 n=10)
TypeFieldsCache/HitTypes100-104        0.4645n ±  4%   0.4697n ±  1%        ~ (p=0.128 n=10)
TypeFieldsCache/HitTypes1000-104       0.4722n ±  2%   0.4755n ±  0%        ~ (p=0.066 n=10)
TypeFieldsCache/HitTypes10000-104      0.4675n ±  3%   0.4657n ±  1%        ~ (p=0.383 n=10)
TypeFieldsCache/HitTypes100000-104     0.4668n ±  3%   0.4683n ±  0%        ~ (p=0.383 n=10)
TypeFieldsCache/HitTypes1000000-104    0.4616n ±  0%   0.4623n ±  0%   +0.15% (p=0.014 n=10)
EncodeMarshaler-104                     10.07n ±  4%    10.04n ±  7%        ~ (p=0.955 n=10)
EncoderEncode-104                       7.961n ±  4%    8.260n ± 50%        ~ (p=0.739 n=10)
NumberIsValid-104                       18.18n ±  0%    21.74n ±  0%  +19.58% (p=0.000 n=10)
NumberIsValidRegexp-104                 360.5n ±  1%    368.9n ±  1%   +2.33% (p=0.000 n=10)
geomean                                 1.783µ          1.807µ         +1.33%

                       │    old1023.log     │              new1023-2.log               │
                       │        B/op        │      B/op       vs base                  │
CodeEncoder-104            6.500 ± 29838%     794.000 ± 479%        ~ (p=0.085 n=10)
CodeEncoderError-104     3.159Ki ±   113%     1.184Ki ± 266%        ~ (p=0.646 n=10)
CodeMarshal-104          1.885Mi ±     2%     1.866Mi ±   2%        ~ (p=0.631 n=10)
CodeMarshalError-104     1.933Mi ±     2%     1.926Mi ±   2%        ~ (p=0.280 n=10)
MarshalMap-104             232.0 ±     0%       232.0 ±   0%        ~ (p=1.000 n=10) ¹
CodeDecoder-104          1.099Mi ±     1%     1.118Mi ±   2%   +1.73% (p=0.022 n=10)
UnicodeDecoder-104         28.00 ±     0%       28.00 ±   0%        ~ (p=1.000 n=10) ¹
DecoderStream-104          8.000 ±     0%       8.000 ±   0%        ~ (p=1.000 n=10) ¹
CodeUnmarshal-104        1.918Mi ±     0%     1.918Mi ±   0%        ~ (p=0.169 n=10)
CodeUnmarshalReuse-104   776.2Ki ±     0%     777.5Ki ±   0%        ~ (p=0.190 n=10)
UnmarshalString-104        160.0 ±     0%       160.0 ±   0%        ~ (p=1.000 n=10) ¹
UnmarshalFloat64-104       144.0 ±     0%       144.0 ±   0%        ~ (p=1.000 n=10) ¹
UnmarshalInt64-104         144.0 ±     0%       144.0 ±   0%        ~ (p=1.000 n=10) ¹
UnmarshalMap-104           256.0 ±     0%       256.0 ±   0%        ~ (p=1.000 n=10) ¹
Issue10335-104             168.0 ±     0%       168.0 ±   0%        ~ (p=1.000 n=10) ¹
Issue34127-104             32.00 ±     0%       32.00 ±   0%        ~ (p=1.000 n=10) ¹
Unmapped-104               200.0 ±     0%       200.0 ±   0%        ~ (p=1.000 n=10) ¹
EncodeMarshaler-104        4.000 ±     0%       4.000 ±   0%        ~ (p=1.000 n=10) ¹
EncoderEncode-104          0.000 ±     0%       0.000 ±   0%        ~ (p=1.000 n=10) ¹
geomean                                   ²                   +22.32%                ²
¹ all samples are equal
² summaries must be >0 to compute geomean

                       │  old1023.log  │            new1023-2.log             │
                       │   allocs/op   │  allocs/op   vs base                 │
CodeEncoder-104           0.000 ± 0%      0.000 ± 0%       ~ (p=1.000 n=10) ¹
CodeEncoderError-104      4.000 ± 0%      4.000 ± 0%       ~ (p=1.000 n=10) ¹
CodeMarshal-104           1.000 ± 0%      1.000 ± 0%       ~ (p=1.000 n=10) ¹
CodeMarshalError-104      7.000 ± 0%      7.000 ± 0%       ~ (p=1.000 n=10) ¹
MarshalMap-104            8.000 ± 0%      8.000 ± 0%       ~ (p=1.000 n=10) ¹
CodeDecoder-104          25.83k ± 0%     25.87k ± 0%  +0.16% (p=0.019 n=10)
UnicodeDecoder-104        2.000 ± 0%      2.000 ± 0%       ~ (p=1.000 n=10) ¹
DecoderStream-104         1.000 ± 0%      1.000 ± 0%       ~ (p=1.000 n=10) ¹
CodeUnmarshal-104        39.99k ± 0%     39.99k ± 0%       ~ (p=1.000 n=10) ¹
CodeUnmarshalReuse-104   25.97k ± 0%     25.98k ± 0%       ~ (p=0.172 n=10)
UnmarshalString-104       2.000 ± 0%      2.000 ± 0%       ~ (p=1.000 n=10) ¹
UnmarshalFloat64-104      1.000 ± 0%      1.000 ± 0%       ~ (p=1.000 n=10) ¹
UnmarshalInt64-104        1.000 ± 0%      1.000 ± 0%       ~ (p=1.000 n=10) ¹
UnmarshalMap-104          12.00 ± 0%      12.00 ± 0%       ~ (p=1.000 n=10) ¹
Issue10335-104            3.000 ± 0%      3.000 ± 0%       ~ (p=1.000 n=10) ¹
Issue34127-104            2.000 ± 0%      2.000 ± 0%       ~ (p=1.000 n=10) ¹
Unmapped-104              4.000 ± 0%      4.000 ± 0%       ~ (p=1.000 n=10) ¹
EncodeMarshaler-104       1.000 ± 0%      1.000 ± 0%       ~ (p=1.000 n=10) ¹
EncoderEncode-104         0.000 ± 0%      0.000 ± 0%       ~ (p=1.000 n=10) ¹
geomean                              ²                +0.01%                ²
¹ all samples are equal
² summaries must be >0 to compute geomean

                       │ old1023.log  │            new1023-2.log            │
                       │     B/s      │     B/s       vs base               │
CodeEncoder-104          12.65Gi ± 1%   11.75Gi ± 3%  -7.14% (p=0.000 n=10)
CodeEncoderError-104     12.16Gi ± 0%   11.42Gi ± 3%  -6.05% (p=0.000 n=10)
CodeMarshal-104          10.47Gi ± 3%   10.29Gi ± 4%       ~ (p=0.190 n=10)
CodeMarshalError-104     9.352Gi ± 1%   9.085Gi ± 2%  -2.85% (p=0.000 n=10)
CodeDecoder-104          2.879Gi ± 2%   2.819Gi ± 1%  -2.07% (p=0.004 n=10)
UnicodeDecoder-104       49.36Mi ± 0%   49.44Mi ± 0%       ~ (p=0.085 n=10)
CodeUnmarshal-104        1.877Gi ± 2%   1.815Gi ± 2%  -3.30% (p=0.002 n=10)
CodeUnmarshalReuse-104   2.596Gi ± 1%   2.526Gi ± 2%  -2.69% (p=0.002 n=10)
geomean                  3.169Gi        3.067Gi       -3.24%

pkg: encoding/pem
           │ old1023.log │           new1023-2.log            │
           │   sec/op    │   sec/op     vs base               │
Encode-104   92.85µ ± 0%   92.81µ ± 0%       ~ (p=0.971 n=10)
Decode-104   210.4µ ± 1%   206.4µ ± 1%  -1.90% (p=0.000 n=10)
geomean      139.8µ        138.4µ       -0.98%

           │ old1023.log  │            new1023-2.log            │
           │     B/s      │     B/s       vs base               │
Encode-104   673.2Mi ± 0%   673.4Mi ± 0%       ~ (p=0.954 n=10)
Decode-104   402.5Mi ± 1%   410.3Mi ± 1%  +1.94% (p=0.000 n=10)
geomean      520.5Mi        525.6Mi       +0.99%

pkg: encoding/xml
                  │ old1023.log  │            new1023-2.log            │
                  │    sec/op    │    sec/op     vs base               │
Marshal-104         2.976µ ± 13%   3.027µ ± 12%       ~ (p=0.404 n=10)
Unmarshal-104       5.819µ ±  4%   5.774µ ±  2%       ~ (p=0.912 n=10)
HTMLAutoClose-104   1.946µ ±  1%   1.971µ ±  1%  +1.28% (p=0.010 n=10)
geomean             3.230µ         3.254µ        +0.74%

              │ old1023.log  │             new1023-2.log             │
              │     B/op     │     B/op      vs base                 │
Marshal-104     5.570Ki ± 0%   5.570Ki ± 0%       ~ (p=1.000 n=10) ¹
Unmarshal-104   7.587Ki ± 0%   7.587Ki ± 0%       ~ (p=1.000 n=10) ¹
geomean         6.501Ki        6.501Ki       +0.00%
¹ all samples are equal

              │ old1023.log │            new1023-2.log            │
              │  allocs/op  │ allocs/op   vs base                 │
Marshal-104      23.00 ± 0%   23.00 ± 0%       ~ (p=1.000 n=10) ¹
Unmarshal-104    185.0 ± 0%   185.0 ± 0%       ~ (p=1.000 n=10) ¹
geomean          65.23        65.23       +0.00%
¹ all samples are equal

pkg: regexp
                                 │  old1023.log  │             new1023-2.log             │
                                 │    sec/op     │    sec/op      vs base                │
Find-104                            210.3n ±  7%    217.8n ±  0%        ~ (p=0.137 n=10)
FindAllNoMatches-104               102.10n ±  5%    99.88n ±  2%   -2.17% (p=0.000 n=10)
FindString-104                      214.2n ±  0%    218.3n ±  3%   +1.89% (p=0.000 n=10)
FindSubmatch-104                    292.4n ±  1%    261.8n ±  1%  -10.48% (p=0.000 n=10)
FindStringSubmatch-104              286.7n ±  1%    257.6n ±  3%  -10.15% (p=0.000 n=10)
Literal-104                         63.62n ±  0%    64.32n ±  2%   +1.10% (p=0.000 n=10)
NotLiteral-104                      1.126µ ±  0%    1.135µ ±  0%   +0.84% (p=0.000 n=10)
MatchClass-104                      1.577µ ±  2%    1.638µ ±  0%   +3.90% (p=0.000 n=10)
MatchClass_InRange-104              1.522µ ±  1%    1.572µ ±  0%   +3.25% (p=0.000 n=10)
ReplaceAll-104                      1.131µ ±  0%    1.127µ ±  0%   -0.40% (p=0.000 n=10)
AnchoredLiteralShortNonMatch-104    53.31n ± 10%    52.47n ±  0%   -1.57% (p=0.000 n=10)
AnchoredLiteralLongNonMatch-104     67.59n ±  5%    69.73n ±  1%   +3.18% (p=0.022 n=10)
AnchoredShortMatch-104              88.05n ±  1%    86.80n ±  2%        ~ (p=0.085 n=10)
AnchoredLongMatch-104               202.3n ±  0%    202.2n ±  0%        ~ (p=0.534 n=10)
OnePassShortA-104                   367.1n ±  0%    380.1n ±  0%   +3.53% (p=0.000 n=10)
NotOnePassShortA-104                410.4n ±  0%    387.9n ±  0%   -5.48% (p=0.000 n=10)
OnePassShortB-104                   299.2n ±  0%    314.7n ±  0%   +5.18% (p=0.000 n=10)
NotOnePassShortB-104                296.5n ±  2%    282.4n ±  1%   -4.76% (p=0.000 n=10)
OnePassLongPrefix-104               67.45n ±  0%    67.83n ±  1%        ~ (p=0.137 n=10)
OnePassLongNotPrefix-104            220.8n ±  0%    243.1n ±  1%  +10.05% (p=0.000 n=10)
MatchParallelShared-104             7.143n ±  8%    3.423n ±  1%  -52.08% (p=0.000 n=10)
MatchParallelCopied-104             6.530n ± 17%    3.495n ±  1%  -46.48% (p=0.000 n=10)
QuoteMetaAll-104                    112.8n ±  1%    111.9n ±  1%   -0.84% (p=0.001 n=10)
QuoteMetaNone-104                   51.39n ±  1%    45.69n ±  2%  -11.09% (p=0.000 n=10)
Compile/Onepass-104                 5.435µ ± 10%    5.800µ ±  5%   +6.72% (p=0.043 n=10)
Compile/Medium-104                  13.54µ ± 13%    13.86µ ±  1%   +2.39% (p=0.011 n=10)
Compile/Hard-104                    115.7µ ±  2%    113.4µ ± 12%        ~ (p=0.684 n=10)
Match/Easy0/16-104                  4.385n ±  0%    4.383n ±  0%   -0.05% (p=0.001 n=10)
Match/Easy0/32-104                  49.21n ±  3%    46.01n ±  4%   -6.51% (p=0.000 n=10)
Match/Easy0/1K-104                  258.2n ±  0%    256.9n ±  0%   -0.50% (p=0.000 n=10)
Match/Easy0/32K-104                 4.376µ ±  1%    4.481µ ±  1%   +2.40% (p=0.000 n=10)
Match/Easy0/1M-104                  259.1µ ±  0%    251.6µ ±  0%   -2.90% (p=0.000 n=10)
Match/Easy0/32M-104                 8.563m ±  1%    8.396m ±  1%   -1.95% (p=0.000 n=10)
Match/Easy0i/16-104                 4.384n ±  0%    4.385n ±  0%        ~ (p=0.558 n=10)
Match/Easy0i/32-104                 799.4n ±  0%    766.0n ±  0%   -4.17% (p=0.000 n=10)
Match/Easy0i/1K-104                 23.49µ ±  0%    22.31µ ±  0%   -5.00% (p=0.000 n=10)
Match/Easy0i/32K-104                949.9µ ±  1%    901.6µ ±  1%   -5.08% (p=0.000 n=10)
Match/Easy0i/1M-104                 30.34m ±  1%    29.03m ±  1%   -4.30% (p=0.000 n=10)
Match/Easy0i/32M-104                968.3m ±  0%    932.8m ±  1%   -3.66% (p=0.000 n=10)
Match/Easy1/16-104                  4.386n ±  0%    4.384n ±  0%        ~ (p=0.073 n=10)
Match/Easy1/32-104                  42.73n ±  0%    41.78n ±  3%        ~ (p=0.085 n=10)
Match/Easy1/1K-104                  618.9n ±  0%    614.3n ±  0%   -0.73% (p=0.000 n=10)
Match/Easy1/32K-104                 28.63µ ±  2%    28.99µ ±  1%        ~ (p=0.052 n=10)
Match/Easy1/1M-104                  997.9µ ±  2%   1015.1µ ±  0%   +1.72% (p=0.029 n=10)
Match/Easy1/32M-104                 31.82m ±  2%    32.50m ±  0%   +2.16% (p=0.000 n=10)
Match/Medium/16-104                 4.385n ±  0%    4.386n ±  0%        ~ (p=0.348 n=10)
Match/Medium/32-104                 773.2n ±  1%    750.3n ±  0%   -2.97% (p=0.000 n=10)
Match/Medium/1K-104                 24.19µ ±  1%    23.44µ ±  0%   -3.10% (p=0.000 n=10)
Match/Medium/32K-104                994.3µ ±  0%    953.8µ ±  1%   -4.07% (p=0.000 n=10)
Match/Medium/1M-104                 31.91m ±  0%    30.26m ±  1%   -5.16% (p=0.000 n=10)
Match/Medium/32M-104               1020.3m ±  0%    970.6m ±  1%   -4.87% (p=0.000 n=10)
Match/Hard/16-104                   4.385n ±  0%    4.384n ±  0%   -0.03% (p=0.043 n=10)
Match/Hard/32-104                   1.083µ ±  0%    1.107µ ±  0%   +2.26% (p=0.000 n=10)
Match/Hard/1K-104                   33.24µ ±  1%    33.63µ ±  0%   +1.18% (p=0.001 n=10)
Match/Hard/32K-104                  1.408m ±  3%    1.411m ±  1%        ~ (p=1.000 n=10)
Match/Hard/1M-104                   46.52m ±  6%    45.17m ±  1%        ~ (p=0.481 n=10)
Match/Hard/32M-104                   1.399 ±  8%     1.433 ±  1%        ~ (p=0.481 n=10)
Match/Hard1/16-104                  3.196µ ±  1%    3.556µ ±  0%  +11.25% (p=0.000 n=10)
Match/Hard1/32-104                  6.325µ ±  0%    6.850µ ±  0%   +8.30% (p=0.000 n=10)
Match/Hard1/1K-104                  193.1µ ±  0%    213.3µ ±  1%  +10.43% (p=0.000 n=10)
Match/Hard1/32K-104                 6.547m ±  1%    7.129m ±  1%   +8.90% (p=0.000 n=10)
Match/Hard1/1M-104                  209.6m ±  0%    226.7m ±  1%   +8.15% (p=0.000 n=10)
Match/Hard1/32M-104                  6.742 ±  0%     7.309 ±  1%   +8.42% (p=0.000 n=10)
Match_onepass_regex/16-104          267.0n ±  0%    286.2n ±  1%   +7.19% (p=0.000 n=10)
Match_onepass_regex/32-104          472.2n ±  0%    511.8n ±  1%   +8.38% (p=0.000 n=10)
Match_onepass_regex/1K-104          13.21µ ±  0%    14.54µ ±  0%  +10.05% (p=0.000 n=10)
Match_onepass_regex/32K-104         419.4µ ±  0%    462.4µ ±  0%  +10.25% (p=0.000 n=10)
Match_onepass_regex/1M-104          13.40m ±  0%    14.79m ±  0%  +10.38% (p=0.000 n=10)
Match_onepass_regex/32M-104         429.1m ±  0%    473.6m ±  0%  +10.39% (p=0.000 n=10)
geomean                             8.242µ          8.142µ         -1.22%

                            │  old1023.log   │             new1023-2.log              │
                            │      B/op      │     B/op      vs base                  │
Find-104                        0.000 ± 0%       0.000 ± 0%        ~ (p=1.000 n=10) ¹
FindAllNoMatches-104            0.000 ± 0%       0.000 ± 0%        ~ (p=1.000 n=10) ¹
FindString-104                  0.000 ± 0%       0.000 ± 0%        ~ (p=1.000 n=10) ¹
FindSubmatch-104                48.00 ± 2%       49.00 ± 0%   +2.08% (p=0.011 n=10)
FindStringSubmatch-104          32.00 ± 0%       32.00 ± 0%        ~ (p=1.000 n=10) ¹
Compile/Onepass-104           3.961Ki ± 0%     3.961Ki ± 0%        ~ (p=1.000 n=10) ¹
Compile/Medium-104            9.203Ki ± 0%     9.203Ki ± 0%        ~ (p=1.000 n=10) ¹
Compile/Hard-104              82.78Ki ± 0%     82.77Ki ± 0%   -0.00% (p=0.019 n=10)
Match_onepass_regex/16-104      0.000 ± 0%       0.000 ± 0%        ~ (p=1.000 n=10) ¹
Match_onepass_regex/32-104      0.000 ± 0%       0.000 ± 0%        ~ (p=1.000 n=10) ¹
Match_onepass_regex/1K-104      0.000 ± 0%       0.000 ± 0%        ~ (p=1.000 n=10) ¹
Match_onepass_regex/32K-104     4.000 ± 0%       5.000 ± 0%  +25.00% (p=0.000 n=10)
Match_onepass_regex/1M-104      155.0 ± 0%       170.0 ± 1%   +9.68% (p=0.000 n=10)
Match_onepass_regex/32M-104   4.450Ki ± 1%     4.450Ki ± 0%        ~ (p=0.474 n=10)
geomean                                    ²                  +2.43%                ²
¹ all samples are equal
² summaries must be >0 to compute geomean

                            │ old1023.log  │            new1023-2.log            │
                            │  allocs/op   │ allocs/op   vs base                 │
Find-104                      0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=10) ¹
FindAllNoMatches-104          0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=10) ¹
FindString-104                0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=10) ¹
FindSubmatch-104              1.000 ± 0%     1.000 ± 0%       ~ (p=1.000 n=10) ¹
FindStringSubmatch-104        1.000 ± 0%     1.000 ± 0%       ~ (p=1.000 n=10) ¹
Compile/Onepass-104           52.00 ± 0%     52.00 ± 0%       ~ (p=1.000 n=10) ¹
Compile/Medium-104            112.0 ± 0%     112.0 ± 0%       ~ (p=1.000 n=10) ¹
Compile/Hard-104              424.0 ± 0%     424.0 ± 0%       ~ (p=1.000 n=10) ¹
Match_onepass_regex/16-104    0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=10) ¹
Match_onepass_regex/32-104    0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=10) ¹
Match_onepass_regex/1K-104    0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=10) ¹
Match_onepass_regex/32K-104   0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=10) ¹
Match_onepass_regex/1M-104    0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=10) ¹
Match_onepass_regex/32M-104   1.000 ±  ?     1.000 ± 0%       ~ (p=0.474 n=10)
geomean                                  ²               +0.00%                ²
¹ all samples are equal
² summaries must be >0 to compute geomean

                            │  old1023.log  │            new1023-2.log             │
                            │      B/s      │     B/s       vs base                │
QuoteMetaAll-104               118.4Mi ± 1%   119.3Mi ± 1%   +0.81% (p=0.001 n=10)
QuoteMetaNone-104              482.5Mi ± 1%   542.8Mi ± 2%  +12.48% (p=0.000 n=10)
Match/Easy0/16-104             3.398Gi ± 0%   3.400Gi ± 0%   +0.05% (p=0.002 n=10)
Match/Easy0/32-104             620.1Mi ± 3%   663.2Mi ± 4%   +6.96% (p=0.000 n=10)
Match/Easy0/1K-104             3.694Gi ± 0%   3.712Gi ± 0%   +0.49% (p=0.000 n=10)
Match/Easy0/32K-104            6.974Gi ± 1%   6.811Gi ± 1%   -2.34% (p=0.000 n=10)
Match/Easy0/1M-104             3.769Gi ± 0%   3.882Gi ± 0%   +2.98% (p=0.000 n=10)
Match/Easy0/32M-104            3.650Gi ± 1%   3.722Gi ± 1%   +1.99% (p=0.000 n=10)
Match/Easy0i/16-104            3.399Gi ± 0%   3.399Gi ± 0%        ~ (p=0.448 n=10)
Match/Easy0i/32-104            38.18Mi ± 0%   39.84Mi ± 0%   +4.35% (p=0.000 n=10)
Match/Easy0i/1K-104            41.58Mi ± 0%   43.77Mi ± 0%   +5.26% (p=0.000 n=10)
Match/Easy0i/32K-104           32.90Mi ± 1%   34.66Mi ± 1%   +5.35% (p=0.000 n=10)
Match/Easy0i/1M-104            32.96Mi ± 1%   34.45Mi ± 1%   +4.50% (p=0.000 n=10)
Match/Easy0i/32M-104           33.05Mi ± 0%   34.30Mi ± 1%   +3.79% (p=0.000 n=10)
Match/Easy1/16-104             3.398Gi ± 0%   3.399Gi ± 0%   +0.05% (p=0.016 n=10)
Match/Easy1/32-104             714.2Mi ± 0%   730.4Mi ± 3%        ~ (p=0.089 n=10)
Match/Easy1/1K-104             1.541Gi ± 0%   1.552Gi ± 0%   +0.73% (p=0.000 n=10)
Match/Easy1/32K-104            1.066Gi ± 2%   1.053Gi ± 1%        ~ (p=0.052 n=10)
Match/Easy1/1M-104            1002.1Mi ± 2%   985.1Mi ± 0%   -1.69% (p=0.029 n=10)
Match/Easy1/32M-104           1005.8Mi ± 2%   984.5Mi ± 0%   -2.11% (p=0.000 n=10)
Match/Medium/16-104            3.398Gi ± 0%   3.398Gi ± 0%        ~ (p=0.315 n=10)
Match/Medium/32-104            39.47Mi ± 1%   40.67Mi ± 0%   +3.06% (p=0.000 n=10)
Match/Medium/1K-104            40.37Mi ± 1%   41.67Mi ± 0%   +3.21% (p=0.000 n=10)
Match/Medium/32K-104           31.43Mi ± 0%   32.76Mi ± 1%   +4.25% (p=0.000 n=10)
Match/Medium/1M-104            31.34Mi ± 0%   33.04Mi ± 1%   +5.45% (p=0.000 n=10)
Match/Medium/32M-104           31.37Mi ± 0%   32.97Mi ± 1%   +5.12% (p=0.000 n=10)
Match/Hard/16-104              3.398Gi ± 0%   3.399Gi ± 0%   +0.04% (p=0.023 n=10)
Match/Hard/32-104              28.18Mi ± 0%   27.56Mi ± 0%   -2.22% (p=0.000 n=10)
Match/Hard/1K-104              29.38Mi ± 1%   29.03Mi ± 0%   -1.18% (p=0.001 n=10)
Match/Hard/32K-104             22.22Mi ± 3%   22.14Mi ± 1%        ~ (p=1.000 n=10)
Match/Hard/1M-104              21.49Mi ± 7%   22.14Mi ± 1%        ~ (p=0.462 n=10)
Match/Hard/32M-104             22.87Mi ± 8%   22.33Mi ± 1%        ~ (p=0.462 n=10)
Match/Hard1/16-104             4.778Mi ± 1%   4.292Mi ± 0%  -10.18% (p=0.000 n=10)
Match/Hard1/32-104             4.826Mi ± 0%   4.454Mi ± 0%   -7.71% (p=0.000 n=10)
Match/Hard1/1K-104             5.054Mi ± 0%   4.578Mi ± 1%   -9.43% (p=0.000 n=10)
Match/Hard1/32K-104            4.778Mi ± 1%   4.387Mi ± 1%   -8.18% (p=0.000 n=10)
Match/Hard1/1M-104             4.768Mi ± 0%   4.411Mi ± 1%   -7.50% (p=0.000 n=10)
Match/Hard1/32M-104            4.749Mi ± 0%   4.377Mi ± 1%   -7.83% (p=0.000 n=10)
Match_onepass_regex/16-104     57.15Mi ± 0%   53.32Mi ± 1%   -6.71% (p=0.000 n=10)
Match_onepass_regex/32-104     64.63Mi ± 0%   59.63Mi ± 1%   -7.73% (p=0.000 n=10)
Match_onepass_regex/1K-104     73.91Mi ± 0%   67.16Mi ± 0%   -9.13% (p=0.000 n=10)
Match_onepass_regex/32K-104    74.51Mi ± 0%   67.58Mi ± 0%   -9.30% (p=0.000 n=10)
Match_onepass_regex/1M-104     74.62Mi ± 0%   67.60Mi ± 0%   -9.40% (p=0.000 n=10)
Match_onepass_regex/32M-104    74.58Mi ± 0%   67.56Mi ± 0%   -9.41% (p=0.000 n=10)
geomean                        126.5Mi        125.2Mi        -1.06%

pkg: regexp/syntax
                   │ old1023.log │            new1023-2.log            │
                   │   sec/op    │   sec/op     vs base                │
EmptyOpContext-104   153.8n ± 0%   151.5n ± 0%   -1.50% (p=0.000 n=10)
IsWordChar-104       150.8n ± 0%   109.5n ± 0%  -27.39% (p=0.000 n=10)
geomean              152.3n        128.8n       -15.43%

pkg: sort
                             │ old1023.log  │            new1023-2.log             │
                             │    sec/op    │    sec/op     vs base                │
SearchWrappers-104             85.33n ±  0%   83.50n ±  0%   -2.14% (p=0.000 n=10)
SortInts-104                   14.47m ±  0%   14.75m ±  0%   +1.93% (p=0.000 n=10)
SlicesSortInts-104             7.731m ±  0%   7.833m ±  0%   +1.32% (p=0.000 n=10)
SortIsSorted-104               304.0µ ±  2%   266.2µ ±  0%  -12.43% (p=0.000 n=10)
SlicesIsSorted-104             76.39µ ±  2%   68.17µ ±  1%  -10.77% (p=0.000 n=10)
SortStrings-104                31.38m ±  0%   32.18m ±  0%   +2.53% (p=0.000 n=10)
SlicesSortStrings-104          25.83m ±  0%   25.60m ±  0%   -0.90% (p=0.000 n=10)
SortStrings_Sorted-104         840.2µ ±  0%   842.1µ ±  0%   +0.23% (p=0.002 n=10)
SlicesSortStrings_Sorted-104   719.7µ ±  0%   626.7µ ±  6%  -12.92% (p=0.000 n=10)
SortStructs-104                20.00m ±  1%   19.74m ±  1%   -1.32% (p=0.004 n=10)
SortFuncStructs-104            15.02m ±  1%   15.11m ±  1%        ~ (p=0.123 n=10)
SortString1K-104               93.63µ ±  0%   93.65µ ±  0%        ~ (p=0.971 n=10)
SortString1K_Slice-104         115.4µ ±  0%   115.7µ ±  0%   +0.26% (p=0.019 n=10)
StableString1K-104             149.4µ ±  1%   155.0µ ±  1%   +3.78% (p=0.000 n=10)
SortInt1K-104                  18.80µ ±  0%   17.65µ ±  0%   -6.11% (p=0.000 n=10)
SortInt1K_Sorted-104           1.098µ ±  0%   1.084µ ±  0%   -1.23% (p=0.000 n=10)
SortInt1K_Reversed-104         1.603µ ± 25%   1.598µ ± 23%   -0.28% (p=0.017 n=10)
SortInt1K_Mod8-104             6.171µ ±  9%   6.130µ ±  0%   -0.66% (p=0.001 n=10)
StableInt1K-104                66.51µ ±  1%   69.61µ ±  0%   +4.67% (p=0.000 n=10)
StableInt1K_Slice-104          61.85µ ±  0%   63.95µ ±  2%   +3.39% (p=0.000 n=10)
SortInt64K-104                 2.436m ±  0%   2.375m ±  0%   -2.51% (p=0.000 n=10)
SortInt64K_Slice-104           5.401m ±  0%   5.791m ±  0%   +7.22% (p=0.000 n=10)
StableInt64K-104               5.798m ±  0%   6.079m ±  0%   +4.85% (p=0.000 n=10)
Sort1e2-104                    39.30µ ±  1%   40.02µ ±  6%   +1.85% (p=0.001 n=10)
Stable1e2-104                  73.20µ ±  8%   75.30µ ±  1%   +2.87% (p=0.019 n=10)
Sort1e4-104                    7.761m ±  0%   7.947m ±  0%   +2.39% (p=0.000 n=10)
Stable1e4-104                  22.17m ±  0%   22.47m ±  0%   +1.34% (p=0.000 n=10)
Sort1e6-104                     1.165 ±  0%    1.198 ±  0%   +2.86% (p=0.000 n=10)
Stable1e6-104                   4.545 ±  2%    4.504 ±  0%        ~ (p=0.165 n=10)
geomean                        669.5µ         666.4µ         -0.46%

pkg: strconv
                                          │ old1023.log │            new1023-2.log             │
                                          │   sec/op    │    sec/op     vs base                │
Atof64Decimal-104                           30.69n ± 1%    32.41n ± 0%   +5.60% (p=0.000 n=10)
Atof64Float-104                             35.53n ± 0%    40.09n ± 0%  +12.82% (p=0.000 n=10)
Atof64FloatExp-104                          42.62n ± 0%    44.41n ± 0%   +4.20% (p=0.000 n=10)
Atof64Big-104                               111.3n ± 0%    116.8n ± 0%   +4.99% (p=0.000 n=10)
Atof64RandomBits-104                        128.9n ± 0%    118.5n ± 0%   -8.07% (p=0.000 n=10)
Atof64RandomFloats-104                      97.83n ± 0%   105.25n ± 0%   +7.58% (p=0.000 n=10)
Atof64RandomLongFloats-104                  142.2n ± 0%    151.4n ± 0%   +6.47% (p=0.000 n=10)
Atof32Decimal-104                           31.39n ± 0%    32.14n ± 0%   +2.41% (p=0.000 n=10)
Atof32Float-104                             32.98n ± 0%    36.78n ± 0%  +11.52% (p=0.000 n=10)
Atof32FloatExp-104                          45.73n ± 0%    50.01n ± 0%   +9.35% (p=0.000 n=10)
Atof32Random-104                            76.67n ± 0%    86.97n ± 0%  +13.43% (p=0.000 n=10)
Atof32RandomLong-104                        153.1n ± 0%    162.4n ± 0%   +6.04% (p=0.000 n=10)
ParseInt/Pos/7bit-104                       13.78n ± 0%    14.72n ± 0%   +6.79% (p=0.000 n=10)
ParseInt/Pos/26bit-104                      20.05n ± 0%    23.22n ± 0%  +15.81% (p=0.000 n=10)
ParseInt/Pos/31bit-104                      22.55n ± 0%    26.39n ± 0%  +17.03% (p=0.000 n=10)
ParseInt/Pos/56bit-104                      39.99n ± 1%    42.38n ± 0%   +5.98% (p=0.000 n=10)
ParseInt/Pos/63bit-104                      40.90n ± 3%    45.05n ± 0%  +10.15% (p=0.000 n=10)
ParseInt/Neg/7bit-104                       14.33n ± 0%    15.09n ± 0%   +5.30% (p=0.000 n=10)
ParseInt/Neg/26bit-104                      20.62n ± 0%    22.87n ± 0%  +10.91% (p=0.000 n=10)
ParseInt/Neg/31bit-104                      23.09n ± 0%    26.19n ± 0%  +13.43% (p=0.000 n=10)
ParseInt/Neg/56bit-104                      39.90n ± 1%    42.68n ± 1%   +6.95% (p=0.000 n=10)
ParseInt/Neg/63bit-104                      43.51n ± 1%    45.22n ± 1%   +3.93% (p=0.000 n=10)
Atoi/Pos/7bit-104                           5.325n ± 0%    6.387n ± 0%  +19.94% (p=0.000 n=10)
Atoi/Pos/26bit-104                          8.457n ± 0%   11.205n ± 1%  +32.49% (p=0.000 n=10)
Atoi/Pos/31bit-104                          9.713n ± 0%   13.140n ± 2%  +35.28% (p=0.000 n=10)
Atoi/Pos/56bit-104                          14.18n ± 0%    20.30n ± 1%  +43.19% (p=0.000 n=10)
Atoi/Pos/63bit-104                          43.80n ± 0%    46.32n ± 1%   +5.75% (p=0.000 n=10)
Atoi/Neg/7bit-104                           5.638n ± 0%    6.268n ± 0%  +11.17% (p=0.000 n=10)
Atoi/Neg/26bit-104                          8.773n ± 0%   11.720n ± 0%  +33.58% (p=0.000 n=10)
Atoi/Neg/31bit-104                          10.38n ± 1%    13.95n ± 0%  +34.46% (p=0.000 n=10)
Atoi/Neg/56bit-104                          15.06n ± 1%    19.50n ± 0%  +29.48% (p=0.000 n=10)
Atoi/Neg/63bit-104                          45.33n ± 1%    46.43n ± 0%   +2.43% (p=0.000 n=10)
FormatFloat/Decimal-104                     91.03n ± 2%    91.73n ± 0%   +0.77% (p=0.000 n=10)
FormatFloat/Float-104                       121.4n ± 2%    118.3n ± 0%   -2.59% (p=0.000 n=10)
FormatFloat/Exp-104                         127.0n ± 0%    124.1n ± 0%   -2.28% (p=0.000 n=10)
FormatFloat/NegExp-104                      126.9n ± 0%    123.3n ± 0%   -2.84% (p=0.000 n=10)
FormatFloat/LongExp-104                     133.3n ± 0%    128.3n ± 0%   -3.75% (p=0.000 n=10)
FormatFloat/Big-104                         140.5n ± 0%    138.4n ± 0%   -1.49% (p=0.000 n=10)
FormatFloat/BinaryExp-104                   81.58n ± 0%    77.84n ± 0%   -4.58% (p=0.000 n=10)
FormatFloat/32Integer-104                   89.84n ± 2%    91.87n ± 0%   +2.25% (p=0.000 n=10)
FormatFloat/32ExactFraction-104             121.2n ± 0%    120.7n ± 0%   -0.45% (p=0.000 n=10)
FormatFloat/32Point-104                     117.9n ± 1%    115.4n ± 0%   -2.12% (p=0.000 n=10)
FormatFloat/32Exp-104                       132.1n ± 0%    128.2n ± 0%   -2.95% (p=0.000 n=10)
FormatFloat/32NegExp-104                    122.9n ± 1%    119.7n ± 0%   -2.60% (p=0.000 n=10)
FormatFloat/32Shortest-104                  114.4n ± 0%    111.5n ± 0%   -2.53% (p=0.000 n=10)
FormatFloat/32Fixed8Hard-104                92.40n ± 0%    91.13n ± 0%   -1.37% (p=0.000 n=10)
FormatFloat/32Fixed9Hard-104                96.87n ± 3%    95.84n ± 0%   -1.06% (p=0.000 n=10)
FormatFloat/64Fixed1-104                    89.89n ± 0%    85.97n ± 0%   -4.36% (p=0.000 n=10)
FormatFloat/64Fixed2-104                    87.22n ± 1%    83.84n ± 0%   -3.88% (p=0.000 n=10)
FormatFloat/64Fixed3-104                    88.69n ± 1%    85.66n ± 0%   -3.42% (p=0.000 n=10)
FormatFloat/64Fixed4-104                    82.98n ± 0%    81.34n ± 0%   -1.98% (p=0.000 n=10)
FormatFloat/64Fixed12-104                   111.6n ± 1%    107.7n ± 0%   -3.45% (p=0.000 n=10)
FormatFloat/64Fixed16-104                   106.0n ± 1%    104.5n ± 0%   -1.37% (p=0.001 n=10)
FormatFloat/64Fixed12Hard-104               102.2n ± 0%    100.3n ± 0%   -1.86% (p=0.000 n=10)
FormatFloat/64Fixed17Hard-104               113.3n ± 1%    114.5n ± 0%   +1.01% (p=0.000 n=10)
FormatFloat/64Fixed18Hard-104               4.207µ ± 0%    4.099µ ± 0%   -2.56% (p=0.000 n=10)
FormatFloat/Slowpath64-104                  130.3n ± 1%    128.1n ± 0%   -1.65% (p=0.000 n=10)
FormatFloat/SlowpathDenormal64-104          125.4n ± 0%    124.4n ± 0%   -0.80% (p=0.000 n=10)
AppendFloat/Decimal-104                     52.39n ± 0%    56.99n ± 0%   +8.77% (p=0.000 n=10)
AppendFloat/Float-104                       79.11n ± 0%    81.02n ± 0%   +2.42% (p=0.000 n=10)
AppendFloat/Exp-104                         83.94n ± 0%    84.84n ± 0%   +1.07% (p=0.000 n=10)
AppendFloat/NegExp-104                      83.00n ± 0%    83.78n ± 0%   +0.95% (p=0.000 n=10)
AppendFloat/LongExp-104                     89.11n ± 0%    88.42n ± 0%   -0.78% (p=0.000 n=10)
AppendFloat/Big-104                         94.54n ± 0%    95.36n ± 0%   +0.87% (p=0.000 n=10)
AppendFloat/BinaryExp-104                   49.27n ± 0%    46.99n ± 0%   -4.63% (p=0.000 n=10)
AppendFloat/32Integer-104                   52.86n ± 0%    56.91n ± 0%   +7.67% (p=0.000 n=10)
AppendFloat/32ExactFraction-104             75.23n ± 0%    76.98n ± 0%   +2.33% (p=0.000 n=10)
AppendFloat/32Point-104                     73.45n ± 0%    75.04n ± 0%   +2.16% (p=0.000 n=10)
AppendFloat/32Exp-104                       86.51n ± 0%    83.03n ± 0%   -4.02% (p=0.000 n=10)
AppendFloat/32NegExp-104                    77.58n ± 0%    78.43n ± 0%   +1.10% (p=0.000 n=10)
AppendFloat/32Shortest-104                  70.48n ± 0%    70.66n ± 0%   +0.26% (p=0.000 n=10)
AppendFloat/32Fixed8Hard-104                53.96n ± 0%    54.30n ± 0%   +0.62% (p=0.000 n=10)
AppendFloat/32Fixed9Hard-104                58.09n ± 0%    59.18n ± 0%   +1.88% (p=0.000 n=10)
AppendFloat/64Fixed1-104                    49.38n ± 0%    50.06n ± 0%   +1.38% (p=0.000 n=10)
AppendFloat/64Fixed2-104                    48.09n ± 0%    48.52n ± 0%   +0.89% (p=0.000 n=10)
AppendFloat/64Fixed3-104                    49.64n ± 0%    53.18n ± 1%   +7.13% (p=0.000 n=10)
AppendFloat/64Fixed4-104                    47.12n ± 0%    47.29n ± 0%   +0.37% (p=0.000 n=10)
AppendFloat/64Fixed12-104                   75.18n ± 0%    71.22n ± 1%   -5.27% (p=0.000 n=10)
AppendFloat/64Fixed16-104                   66.34n ± 0%    67.69n ± 0%   +2.03% (p=0.000 n=10)
AppendFloat/64Fixed12Hard-104               63.66n ± 1%    63.09n ± 1%   -0.90% (p=0.000 n=10)
AppendFloat/64Fixed17Hard-104               74.13n ± 0%    75.02n ± 0%   +1.19% (p=0.000 n=10)
AppendFloat/64Fixed18Hard-104               4.100µ ± 0%    4.019µ ± 0%   -1.98% (p=0.000 n=10)
AppendFloat/Slowpath64-104                  87.22n ± 0%    87.79n ± 0%   +0.67% (p=0.000 n=10)
AppendFloat/SlowpathDenormal64-104          85.68n ± 1%    85.69n ± 0%        ~ (p=0.839 n=10)
FormatInt-104                               1.736µ ± 2%    1.653µ ± 0%   -4.78% (p=0.000 n=10)
AppendInt-104                               1.074µ ± 0%    1.022µ ± 0%   -4.80% (p=0.000 n=10)
FormatUint-104                              452.3n ± 0%    425.2n ± 0%   -6.00% (p=0.000 n=10)
AppendUint-104                              290.1n ± 0%    267.8n ± 0%   -7.70% (p=0.000 n=10)
FormatIntSmall/7-104                        2.819n ± 0%    3.132n ± 0%  +11.12% (p=0.000 n=10)
FormatIntSmall/42-104                       3.132n ± 0%    3.445n ± 0%   +9.99% (p=0.000 n=10)
AppendIntSmall-104                          5.637n ± 0%    5.951n ± 0%   +5.56% (p=0.000 n=10)
AppendUintVarlen/1-104                      5.325n ± 0%    5.331n ± 0%   +0.10% (p=0.001 n=10)
AppendUintVarlen/12-104                     5.639n ± 0%    5.951n ± 0%   +5.53% (p=0.000 n=10)
AppendUintVarlen/123-104                    13.78n ± 0%    13.47n ± 0%   -2.25% (p=0.000 n=10)
AppendUintVarlen/1234-104                   14.41n ± 0%    13.78n ± 0%   -4.37% (p=0.000 n=10)
AppendUintVarlen/12345-104                  16.29n ± 0%    15.66n ± 0%   -3.87% (p=0.000 n=10)
AppendUintVarlen/123456-104                 17.24n ± 0%    16.60n ± 0%   -3.71% (p=0.000 n=10)
AppendUintVarlen/1234567-104                19.44n ± 0%    18.48n ± 0%   -4.94% (p=0.000 n=10)
AppendUintVarlen/12345678-104               20.08n ± 0%    19.11n ± 0%   -4.83% (p=0.000 n=10)
AppendUintVarlen/123456789-104              22.26n ± 0%    20.99n ± 0%   -5.71% (p=0.000 n=10)
AppendUintVarlen/1234567890-104             23.21n ± 0%    21.61n ± 0%   -6.89% (p=0.000 n=10)
AppendUintVarlen/12345678901-104            25.06n ± 0%    23.49n ± 0%   -6.26% (p=0.000 n=10)
AppendUintVarlen/123456789012-104           26.00n ± 0%    24.43n ± 0%   -6.04% (p=0.000 n=10)
AppendUintVarlen/1234567890123-104          27.88n ± 0%    26.00n ± 0%   -6.74% (p=0.000 n=10)
AppendUintVarlen/12345678901234-104         28.80n ± 0%    26.94n ± 0%   -6.46% (p=0.000 n=10)
AppendUintVarlen/123456789012345-104        31.00n ± 0%    28.82n ± 0%   -7.03% (p=0.000 n=10)
AppendUintVarlen/1234567890123456-104       31.64n ± 0%    29.44n ± 0%   -6.95% (p=0.000 n=10)
AppendUintVarlen/12345678901234567-104      34.12n ± 0%    31.63n ± 0%   -7.31% (p=0.000 n=10)
AppendUintVarlen/123456789012345678-104     35.08n ± 0%    32.26n ± 0%   -8.04% (p=0.000 n=10)
AppendUintVarlen/1234567890123456789-104    36.96n ± 0%    34.14n ± 0%   -7.63% (p=0.000 n=10)
AppendUintVarlen/12345678901234567890-104   37.90n ± 0%    35.08n ± 0%   -7.44% (p=0.000 n=10)
Quote-104                                   296.1n ± 5%    301.8n ± 0%        ~ (p=0.138 n=10)
QuoteRune-104                               34.20n ± 0%    34.16n ± 0%   -0.13% (p=0.000 n=10)
AppendQuote-104                             191.2n ± 0%    190.1n ± 0%   -0.63% (p=0.000 n=10)
AppendQuoteRune-104                         8.832n ± 0%    8.457n ± 0%   -4.25% (p=0.000 n=10)
UnquoteEasy-104                             33.70n ± 1%    34.53n ± 1%   +2.43% (p=0.000 n=10)
UnquoteHard-104                             512.4n ± 0%    476.9n ± 0%   -6.92% (p=0.000 n=10)
geomean                                     52.86n         53.96n        +2.07%

geomean does not show significant changes. However, if we look at specific cases, we will find that they have significant data fluctuations. There are many performance tuning work.

@laboger
Copy link
Contributor

laboger commented Oct 24, 2023 via email

@laboger
Copy link
Contributor

laboger commented Oct 24, 2023

I tried your source tree. It looks like there is a bug when building the go1 benchmark test Fannkuch. This test used to be under test/bench/go1 but it was removed in go 1.21. If you go back to go 1.20 you can build it. One of the loops in the Fannkuch test doesn't exit.

I also noticed that hoisting the NilCheck is not always a good thing. I found a similar problem recently with tighten when a load gets moved to a different block. There is a later pass to eliminate unnecessary NilChecks, but it only gets optimized if the NilCheck is in the same block as the load or store that uses the same pointer/address. Look at the loop in math.BenchmarkSqrtIndirectLatency.

TEXT math_test.BenchmarkSqrtIndirectLatency(SB) /home/boger/golang/y1yang0/go/src/math/all_test.go
        for i := 0; i < b.N; i++ {
  0x13c660              e88301a0                MOVD 416(R3),R4                      // ld r4,416(r3)   
  0x13c664              7c240000                CMP R4,R0                            // cmpd r4,r0      
  0x13c668              40810028                BLE 0x13c690                         // ble 0x13c690    
  0x13c66c              8be30000                MOVBZ 0(R3),R31                      // lbz r31,0(r3)   <-- This should stay before the MOVD that is based on r3.
  0x13c670              38600000                MOVD $0,R3                           // li r3,0         
  0x13c674              06100006 c800f50c       PLFD 455948(0),$1,F0                 // plfd f0,455948  
  0x13c67c              38630001                ADD R3,$1,R3                         // addi r3,r3,1    
        return sqrt(x)
  0x13c680              fc00002c                FSQRT F0,F0                          // fsqrt f0,f0     
        for i := 0; i < b.N; i++ {
  0x13c684              7c241800                CMP R4,R3                            // cmpd r4,r3      
                x = f(x)
  0x13c688              4181fff4                BGT 0x13c67c                         // bgt 0x13c67c    
  0x13c68c              4800000c                BR 0x13c698                          // b 0x13c698      
  0x13c690              06100006 c800f4f0       PLFD 455920(0),$1,F0                 // plfd f0,455920  
        GlobalF = x
  0x13c698              06100018 d800bb70       PSTFD 1620848(0),$1,F0               // pstfd f0,1620848        
}
  0x13c6a0              4e800020                RET                                  // blr     

@randall77
Copy link
Contributor

and I couldn't make
it work because of what tighten does with loads is at odds with what
hoisting is trying to do.

I don't think this should be a problem. True that hoisting and tightening are working in opposite directions, but tightening should never move anything into a loop, so as long as you're correctly hoisting out of a loop then tightening should not undo that work. (Or tightening has a bug.)

@randall77
Copy link
Contributor

(Or maybe you mean the rematerialization of constants that regalloc does? We could disable rematerialization for constants that would require a load from a constant pool.)

@laboger
Copy link
Contributor

laboger commented Oct 25, 2023

(Or maybe you mean the rematerialization of constants that regalloc does? We could disable rematerialization for constants that would require a load from a constant pool.)

I mean loads of constant values which shouldn't have any aliases. I think the rematerialization flag mostly controls where this gets loaded, and tighten has some effect on that. It's been a while since I tried to play with it.

Base addresses of arrays, slices, etc. should be constant/invariant and hoisted instead of reloaded at each iteration. I can't seem to get that to happen.

I should add that I don't know if there is anything in place right now to allow this type of hoisting even without the rematerialization done in regalloc. This statement is based on experimenting with various CLs that have tried to enable hoisting including the new one mentioned in this issue.

@laboger
Copy link
Contributor

laboger commented Oct 25, 2023

Here's an example of what I meant above, from test/bench/go1.revcomp (back before it was removed)

v257
00294 (64) MOVB R11, (R6)(R9)
v259
00295 (+62) ADD $1, R4, R4
v343
00296 (+62) CMP R4, R10
b44
00297 (62) BGE 149
v241
00298 (62) MOVBZ (R3)(R4), R11
v243
00299 (+63) ADD $-1, R9, R9
v151
00300 (+64) MOVD $test/bench/go1.revCompTable(SB), R12 <--- this is a constant load that should be hoisted
v248
00301 (64) MOVBZ (R12)(R11), R11
v15
00302 (64) CMPU R5, R9
b45
00303 (64) BGT 294

@randall77
Copy link
Contributor

I think this is just a problem with our register allocator. At this point

if s.values[v.ID].rematerializeable {
we decide not to issue a rematerializeable op, as we can always issue it just before its use. But if the (a?) use is in deeper loop nest than the definition, we should issue it at its definition point.

@y1yang0
Copy link
Contributor Author

y1yang0 commented Oct 26, 2023

I also recently tried to do a change to add PCALIGN to the top of loop body but because of the way looprotate moved the blocks, the alignment couldn't be inserted in the correct place. With this improved way of doing looprotate it should be more straightforward to insert the PCALIGN in the appropriate spot. I have done some performance comparisons between gccgo and Golang on ppc64 since gccgo does better at optimizing loops in most cases on ppc64. That can show what improvement is possible.

Do you means something like loop alignment. That's interesting.

I tried your source tree. It looks like there is a bug when building the go1 benchmark test Fannkuch. This test used to be under test/bench/go1 but it was removed in go 1.21. If you go back to go 1.20 you can build it. One of the loops in the Fannkuch test doesn't exit.

Thanks a lot! I'll fix this then. The current development is almost complete, and the remaining work mainly involves ensuring robustness and optimizing performance. I will try testing go1 benchmark and golang/benchmarks and then resolve any issues that may arise. (I am thinking about adding support for simple stress testing to the golang compiler, i.e. allow a pass runs between [start pass, end pass], and executing the pass after each pass within that range. This approach would greatly ensure the stability of optimization/transformation.)

Update: I confirmed this bug, TBAA thinks they are NoAlias but actually they do alias, because unsafe pointer may aliases with everything

#v23 = SelectN <unsafe.Pointer> [0] v21 (perm1.ptr[*int])#v328 = AddPtr <*int> v23 v488#NoAlias
v306 (+58) = Load <int> v23 v387 (perm0[int])
v299 (60) = Store <mem> {int} v328 v320 v400

@laboger
Copy link
Contributor

laboger commented Oct 26, 2023

Do you means something like loop alignment. That's interesting.

Yes, there are several ports that support the PCALIGN directive and currently it is only used in assembly implementations. It would be worthwhile to have the compiler generate it for appropriate loops.

we decide not to issue a rematerializeable op, as we can always issue it just before its use. But if the (a?) use is in deeper loop nest than the definition, we should issue it at its definition point.

What do you mean by definition point for a constant? Do you mean that the loop hoisting code would generate the definition?

Thanks a lot! I'll fix this then. The current development is almost complete, and the remaining work mainly involves ensuring robustness and optimizing performance. I will try testing go1 benchmark and golang/benchmarks and then resolve any issues that may arise.

I still don't think the way NilChecks are hoisted is providing the best performance, as I mentioned earlier.

@randall77
Copy link
Contributor

Currently, if you have this before regalloc:

v1 = MOVDaddr {myGlobalVariable}
loop:
    v2 = MOVBload v1 mem

Then after regalloc it moves v1 inside the loop

loop:
    v1 = MOVDaddr {myGlobalVariable}
    v2 = MOVBload v1 mem

We shouldn't do that move if the destination is in a loop. We should force v1 to stay at its original definition point.
Or maybe we reissue it just before the loop starts? In any case, don't move it into the loop.
This is currently a problem whether v1 was originally in the loop and a hoisting phase moved it out, or it was originally outside the loop in the user's code.

@mknyszek mknyszek added this to the Backlog milestone Nov 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance
Projects
Status: In Progress
Development

No branches or pull requests

7 participants