Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert folds to take two arguments #345

Merged
merged 1 commit into from
May 19, 2021
Merged

Conversation

Boarders
Copy link
Contributor

This pull request changes the fold functions of the form: fold f z bs to fold f z = \bs -> ... fixing issue #329.

On my machine it looked like this significantly improved the folds benchmarks but I am unsure if that is because I was running in a noisy environment or some other reason e.g.

Before:

benchmarked Data.ByteString.Builder/folds/foldl'/32768
time                 120.4 μs   (117.8 μs .. 124.2 μs)
                     0.993 R²   (0.986 R² .. 0.997 R²)
mean                 121.9 μs   (120.3 μs .. 124.0 μs)
std dev              6.364 μs   (5.074 μs .. 7.591 μs)
variance introduced by outliers: 31% (moderately inflated)

benchmarked Data.ByteString.Builder/folds/foldl'/65536
time                 233.3 μs   (230.9 μs .. 237.1 μs)
                     0.998 R²   (0.996 R² .. 0.999 R²)
mean                 235.4 μs   (234.0 μs .. 238.2 μs)
std dev              6.545 μs   (4.637 μs .. 9.409 μs)
variance introduced by outliers: 11% (moderately inflated)

After:

benchmarked Data.ByteString.Builder/folds/foldl'/32768
time                 13.79 μs   (13.00 μs .. 14.59 μs)
                     0.987 R²   (0.980 R² .. 0.995 R²)
mean                 13.78 μs   (13.59 μs .. 14.00 μs)
std dev              710.7 ns   (599.7 ns .. 869.6 ns)
variance introduced by outliers: 31% (moderately inflated)

benchmarked Data.ByteString.Builder/folds/foldl'/65536
time                 26.19 μs   (25.51 μs .. 27.10 μs)
                     0.992 R²   (0.988 R² .. 0.995 R²)
mean                 27.38 μs   (26.95 μs .. 28.04 μs)
std dev              1.701 μs   (1.345 μs .. 2.364 μs)
variance introduced by outliers: 39% (moderately inflated)

This was done using ghc-8.10.2.

@Bodigrim
Copy link
Contributor

Bodigrim commented Jan 13, 2021

I can certainly reproduce your benchmarks with roughly the same results.

Before:
folds/foldl'/1                           mean 9.508 ns  ( +- 202.4 ps  )
folds/foldl'/2                           mean 12.66 ns  ( +- 312.4 ps  )
folds/foldl'/4                           mean 18.88 ns  ( +- 608.7 ps  )
folds/foldl'/8                           mean 33.46 ns  ( +- 951.9 ps  )
folds/foldl'/16                          mean 62.70 ns  ( +- 2.429 ns  )
folds/foldl'/32                          mean 111.7 ns  ( +- 3.313 ns  )
folds/foldl'/64                          mean 213.3 ns  ( +- 6.860 ns  )
folds/foldl'/128                         mean 409.9 ns  ( +- 16.32 ns  )
folds/foldl'/256                         mean 814.9 ns  ( +- 27.73 ns  )
folds/foldl'/512                         mean 1.615 μs  ( +- 40.81 ns  )
folds/foldl'/1024                        mean 3.240 μs  ( +- 121.7 ns  )
folds/foldl'/2048                        mean 6.429 μs  ( +- 244.1 ns  )
folds/foldl'/4096                        mean 12.98 μs  ( +- 479.4 ns  )
folds/foldl'/8192                        mean 26.08 μs  ( +- 1.129 μs  )
folds/foldl'/16384                       mean 51.77 μs  ( +- 1.711 μs  )
folds/foldl'/32768                       mean 104.2 μs  ( +- 4.380 μs  )
folds/foldl'/65536                       mean 209.3 μs  ( +- 10.51 μs  )
After:
folds/foldl'/1                           mean 3.833 ns  ( +- 95.21 ps  )
folds/foldl'/2                           mean 4.195 ns  ( +- 177.2 ps  )
folds/foldl'/4                           mean 6.481 ns  ( +- 161.2 ps  )
folds/foldl'/8                           mean 8.896 ns  ( +- 196.2 ps  )
folds/foldl'/16                          mean 13.77 ns  ( +- 390.6 ps  )
folds/foldl'/32                          mean 27.55 ns  ( +- 476.3 ps  )
folds/foldl'/64                          mean 41.14 ns  ( +- 684.2 ps  )
folds/foldl'/128                         mean 69.15 ns  ( +- 1.425 ns  )
folds/foldl'/256                         mean 124.3 ns  ( +- 2.369 ns  )
folds/foldl'/512                         mean 233.5 ns  ( +- 4.240 ns  )
folds/foldl'/1024                        mean 452.1 ns  ( +- 7.693 ns  )
folds/foldl'/2048                        mean 888.6 ns  ( +- 15.60 ns  )
folds/foldl'/4096                        mean 1.790 μs  ( +- 41.93 ns  )
folds/foldl'/8192                        mean 3.500 μs  ( +- 45.58 ns  )
folds/foldl'/16384                       mean 7.078 μs  ( +- 169.4 ns  )
folds/foldl'/32768                       mean 13.97 μs  ( +- 176.2 ns  )
folds/foldl'/65536                       mean 28.01 μs  ( +- 465.3 ns  )

Copy link
Contributor

@Bodigrim Bodigrim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good. Please add comments (not haddock ones) explaining reasons behind these lambdas + update .hlint.yaml.

@Boarders
Copy link
Contributor Author

@Bodigrim I added comments, updated hlint and also I applied the same transformation to similar functions. Every benchmark improves quite a bit as functions are actually getting inlined now.

.hlint.yaml Outdated Show resolved Hide resolved
Data/ByteString.hs Outdated Show resolved Hide resolved
@Bodigrim
Copy link
Contributor

Every benchmark improves quite a bit as functions are actually getting inlined now.

Great! Could you please quote results, at least for the functions with pre-existing benchmarks?

@Boarders
Copy link
Contributor Author

Boarders commented Jan 14, 2021

Here are the relevant benchmarks (just "folds" in BenchAll):

Old results:
folds/foldl'/1                           mean 12.79 ns  ( +- 1.327 ns  )
folds/foldl'/2                           mean 17.42 ns  ( +- 1.614 ns  )
folds/foldl'/4                           mean 24.47 ns  ( +- 2.322 ns  )
folds/foldl'/8                           mean 43.27 ns  ( +- 7.573 ns  )
folds/foldl'/16                          mean 76.70 ns  ( +- 7.063 ns  )
folds/foldl'/32                          mean 144.7 ns  ( +- 12.22 ns  )
folds/foldl'/64                          mean 256.3 ns  ( +- 12.97 ns  )
folds/foldl'/128                         mean 510.5 ns  ( +- 34.53 ns  )
folds/foldl'/256                         mean 975.2 ns  ( +- 102.5 ns  )
folds/foldl'/512                         mean 2.055 μs  ( +- 137.1 ns  )
folds/foldl'/1024                        mean 3.821 μs  ( +- 248.7 ns  )
folds/foldl'/2048                        mean 7.326 μs  ( +- 226.5 ns  )
folds/foldl'/4096                        mean 14.73 μs  ( +- 544.0 ns  )
folds/foldl'/8192                        mean 31.00 μs  ( +- 1.704 μs  )
folds/foldl'/16384                       mean 59.82 μs  ( +- 2.900 μs  )
folds/foldl'/32768                       mean 122.0 μs  ( +- 7.571 μs  )
folds/foldl'/65536                       mean 250.6 μs  ( +- 19.10 μs  )
folds/foldr'/1                           mean 11.77 ns  ( +- 612.5 ps  )
folds/foldr'/2                           mean 14.86 ns  ( +- 642.5 ps  )
folds/foldr'/4                           mean 21.93 ns  ( +- 868.3 ps  )
folds/foldr'/8                           mean 34.80 ns  ( +- 1.648 ns  )
folds/foldr'/16                          mean 61.46 ns  ( +- 2.352 ns  )
folds/foldr'/32                          mean 115.8 ns  ( +- 6.026 ns  )
folds/foldr'/64                          mean 245.0 ns  ( +- 20.18 ns  )
folds/foldr'/128                         mean 505.2 ns  ( +- 51.39 ns  )
folds/foldr'/256                         mean 1.008 μs  ( +- 119.4 ns  )
folds/foldr'/512                         mean 1.943 μs  ( +- 136.9 ns  )
folds/foldr'/1024                        mean 3.724 μs  ( +- 222.3 ns  )
folds/foldr'/2048                        mean 7.136 μs  ( +- 452.6 ns  )
folds/foldr'/4096                        mean 14.76 μs  ( +- 1.649 μs  )
folds/foldr'/8192                        mean 27.76 μs  ( +- 1.174 μs  )
folds/foldr'/16384                       mean 57.07 μs  ( +- 2.989 μs  )
folds/foldr'/32768                       mean 123.2 μs  ( +- 10.59 μs  )
folds/foldr'/65536                       mean 246.2 μs  ( +- 18.28 μs  )
folds/mapAccumL/1                        mean 24.37 ns  ( +- 3.095 ns  )
folds/mapAccumL/2                        mean 33.42 ns  ( +- 5.127 ns  )
folds/mapAccumL/4                        mean 44.39 ns  ( +- 7.768 ns  )
folds/mapAccumL/8                        mean 53.17 ns  ( +- 3.565 ns  )
folds/mapAccumL/16                       mean 88.96 ns  ( +- 5.184 ns  )
folds/mapAccumL/32                       mean 179.5 ns  ( +- 29.30 ns  )
folds/mapAccumL/64                       mean 337.2 ns  ( +- 35.81 ns  )
folds/mapAccumL/128                      mean 576.6 ns  ( +- 35.00 ns  )
folds/mapAccumL/256                      mean 1.161 μs  ( +- 83.48 ns  )
folds/mapAccumL/512                      mean 2.440 μs  ( +- 282.3 ns  )
folds/mapAccumL/1024                     mean 4.536 μs  ( +- 404.8 ns  )
folds/mapAccumL/2048                     mean 10.17 μs  ( +- 1.427 μs  )
folds/mapAccumL/4096                     mean 19.19 μs  ( +- 1.846 μs  )
folds/mapAccumL/8192                     mean 37.53 μs  ( +- 3.908 μs  )
folds/mapAccumL/16384                    mean 73.92 μs  ( +- 4.755 μs  )
folds/mapAccumL/32768                    mean 149.0 μs  ( +- 14.60 μs  )
folds/mapAccumL/65536                    mean 295.5 μs  ( +- 11.63 μs  )
folds/mapAccumR/1                        mean 24.62 ns  ( +- 3.710 ns  )
folds/mapAccumR/2                        mean 26.74 ns  ( +- 1.684 ns  )
folds/mapAccumR/4                        mean 35.89 ns  ( +- 1.836 ns  )
folds/mapAccumR/8                        mean 50.74 ns  ( +- 2.645 ns  )
folds/mapAccumR/16                       mean 84.15 ns  ( +- 4.996 ns  )
folds/mapAccumR/32                       mean 155.4 ns  ( +- 10.39 ns  )
folds/mapAccumR/64                       mean 280.4 ns  ( +- 8.924 ns  )
folds/mapAccumR/128                      mean 524.7 ns  ( +- 10.89 ns  )
folds/mapAccumR/256                      mean 1.023 μs  ( +- 17.79 ns  )
folds/mapAccumR/512                      mean 2.234 μs  ( +- 206.5 ns  )
folds/mapAccumR/1024                     mean 4.363 μs  ( +- 247.6 ns  )
folds/mapAccumR/2048                     mean 9.046 μs  ( +- 735.8 ns  )
folds/mapAccumR/4096                     mean 19.17 μs  ( +- 1.595 μs  )
folds/mapAccumR/8192                     mean 34.62 μs  ( +- 1.607 μs  )
folds/mapAccumR/16384                    mean 68.72 μs  ( +- 4.259 μs  )
folds/mapAccumR/32768                    mean 134.5 μs  ( +- 3.174 μs  )
folds/mapAccumR/65536                    mean 281.8 μs  ( +- 7.910 μs  )
folds/scanl/1                            mean 19.26 ns  ( +- 367.5 ps  )
folds/scanl/2                            mean 23.13 ns  ( +- 665.1 ps  )
folds/scanl/4                            mean 30.82 ns  ( +- 340.8 ps  )
folds/scanl/8                            mean 47.19 ns  ( +- 1.184 ns  )
folds/scanl/16                           mean 79.48 ns  ( +- 1.626 ns  )
folds/scanl/32                           mean 145.4 ns  ( +- 3.369 ns  )
folds/scanl/64                           mean 281.4 ns  ( +- 7.131 ns  )
folds/scanl/128                          mean 533.1 ns  ( +- 19.41 ns  )
folds/scanl/256                          mean 1.039 μs  ( +- 28.24 ns  )
folds/scanl/512                          mean 2.057 μs  ( +- 75.85 ns  )
folds/scanl/1024                         mean 4.059 μs  ( +- 52.30 ns  )
folds/scanl/2048                         mean 8.430 μs  ( +- 316.1 ns  )
folds/scanl/4096                         mean 16.92 μs  ( +- 708.5 ns  )
folds/scanl/8192                         mean 33.08 μs  ( +- 590.2 ns  )
folds/scanl/16384                        mean 65.84 μs  ( +- 936.3 ns  )
folds/scanl/32768                        mean 131.7 μs  ( +- 1.995 μs  )
folds/scanl/65536                        mean 276.9 μs  ( +- 4.149 μs  )
folds/scanr/1                            mean 19.47 ns  ( +- 583.1 ps  )
folds/scanr/2                            mean 23.72 ns  ( +- 267.8 ps  )
folds/scanr/4                            mean 31.33 ns  ( +- 659.7 ps  )
folds/scanr/8                            mean 46.47 ns  ( +- 4.092 ns  )
folds/scanr/16                           mean 76.44 ns  ( +- 726.1 ps  )
folds/scanr/32                           mean 138.6 ns  ( +- 2.647 ns  )
folds/scanr/64                           mean 265.6 ns  ( +- 5.044 ns  )
folds/scanr/128                          mean 507.5 ns  ( +- 11.02 ns  )
folds/scanr/256                          mean 1.014 μs  ( +- 26.98 ns  )
folds/scanr/512                          mean 2.018 μs  ( +- 46.40 ns  )
folds/scanr/1024                         mean 3.993 μs  ( +- 147.0 ns  )
folds/scanr/2048                         mean 7.942 μs  ( +- 226.8 ns  )
folds/scanr/4096                         mean 15.60 μs  ( +- 391.1 ns  )
folds/scanr/8192                         mean 31.55 μs  ( +- 1.547 μs  )
folds/scanr/16384                        mean 62.33 μs  ( +- 1.043 μs  )
folds/scanr/32768                        mean 124.2 μs  ( +- 3.972 μs  )
folds/scanr/65536                        mean 261.8 μs  ( +- 4.147 μs  )
folds/filter/1                           mean 27.35 ns  ( +- 1.019 ns  )
folds/filter/2                           mean 31.79 ns  ( +- 757.3 ps  )
folds/filter/4                           mean 40.62 ns  ( +- 2.607 ns  )
folds/filter/8                           mean 54.16 ns  ( +- 2.341 ns  )
folds/filter/16                          mean 84.25 ns  ( +- 3.366 ns  )
folds/filter/32                          mean 142.7 ns  ( +- 2.503 ns  )
folds/filter/64                          mean 271.5 ns  ( +- 6.071 ns  )
folds/filter/128                         mean 496.4 ns  ( +- 16.61 ns  )
folds/filter/256                         mean 948.1 ns  ( +- 12.83 ns  )
folds/filter/512                         mean 1.914 μs  ( +- 27.92 ns  )
folds/filter/1024                        mean 3.752 μs  ( +- 80.10 ns  )
folds/filter/2048                        mean 7.460 μs  ( +- 253.4 ns  )
folds/filter/4096                        mean 14.78 μs  ( +- 204.6 ns  )
folds/filter/8192                        mean 29.40 μs  ( +- 491.6 ns  )
folds/filter/16384                       mean 59.13 μs  ( +- 2.218 μs  )
folds/filter/32768                       mean 117.0 μs  ( +- 1.840 μs  )
folds/filter/65536                       mean 239.7 μs  ( +- 3.506 μs  )
New results:
folds/foldl'/1                           mean 5.270 ns  ( +- 261.4 ps  )
folds/foldl'/2                           mean 6.029 ns  ( +- 163.7 ps  )
folds/foldl'/4                           mean 7.648 ns  ( +- 311.3 ps  )
folds/foldl'/8                           mean 9.727 ns  ( +- 465.6 ps  )
folds/foldl'/16                          mean 14.25 ns  ( +- 1.132 ns  )
folds/foldl'/32                          mean 22.96 ns  ( +- 718.6 ps  )
folds/foldl'/64                          mean 40.71 ns  ( +- 928.4 ps  )
folds/foldl'/128                         mean 81.32 ns  ( +- 1.826 ns  )
folds/foldl'/256                         mean 150.4 ns  ( +- 1.411 ns  )
folds/foldl'/512                         mean 290.1 ns  ( +- 4.416 ns  )
folds/foldl'/1024                        mean 568.6 ns  ( +- 8.198 ns  )
folds/foldl'/2048                        mean 1.126 μs  ( +- 12.40 ns  )
folds/foldl'/4096                        mean 2.254 μs  ( +- 64.55 ns  )
folds/foldl'/8192                        mean 4.502 μs  ( +- 90.10 ns  )
folds/foldl'/16384                       mean 8.944 μs  ( +- 110.6 ns  )
folds/foldl'/32768                       mean 20.09 μs  ( +- 1.845 μs  )
folds/foldl'/65536                       mean 35.43 μs  ( +- 583.7 ns  )
folds/foldr'/1                           mean 4.402 ns  ( +- 184.5 ps  )
folds/foldr'/2                           mean 4.620 ns  ( +- 189.5 ps  )
folds/foldr'/4                           mean 5.092 ns  ( +- 94.68 ps  )
folds/foldr'/8                           mean 6.572 ns  ( +- 178.4 ps  )
folds/foldr'/16                          mean 9.765 ns  ( +- 363.7 ps  )
folds/foldr'/32                          mean 15.66 ns  ( +- 435.8 ps  )
folds/foldr'/64                          mean 26.39 ns  ( +- 330.7 ps  )
folds/foldr'/128                         mean 56.17 ns  ( +- 1.296 ns  )
folds/foldr'/256                         mean 100.2 ns  ( +- 1.249 ns  )
folds/foldr'/512                         mean 188.5 ns  ( +- 2.213 ns  )
folds/foldr'/1024                        mean 369.5 ns  ( +- 9.936 ns  )
folds/foldr'/2048                        mean 726.1 ns  ( +- 11.52 ns  )
folds/foldr'/4096                        mean 1.441 μs  ( +- 22.62 ns  )
folds/foldr'/8192                        mean 2.880 μs  ( +- 79.11 ns  )
folds/foldr'/16384                       mean 5.753 μs  ( +- 149.7 ns  )
folds/foldr'/32768                       mean 11.46 μs  ( +- 224.2 ns  )
folds/foldr'/65536                       mean 23.85 μs  ( +- 1.509 μs  )
folds/mapAccumL/1                        mean 15.31 ns  ( +- 703.8 ps  )
folds/mapAccumL/2                        mean 15.68 ns  ( +- 179.7 ps  )
folds/mapAccumL/4                        mean 18.14 ns  ( +- 1.385 ns  )
folds/mapAccumL/8                        mean 20.31 ns  ( +- 1.721 ns  )
folds/mapAccumL/16                       mean 23.91 ns  ( +- 1.486 ns  )
folds/mapAccumL/32                       mean 32.98 ns  ( +- 801.4 ps  )
folds/mapAccumL/64                       mean 53.83 ns  ( +- 4.928 ns  )
folds/mapAccumL/128                      mean 100.9 ns  ( +- 10.37 ns  )
folds/mapAccumL/256                      mean 172.0 ns  ( +- 8.031 ns  )
folds/mapAccumL/512                      mean 307.3 ns  ( +- 7.875 ns  )
folds/mapAccumL/1024                     mean 610.8 ns  ( +- 39.48 ns  )
folds/mapAccumL/2048                     mean 1.300 μs  ( +- 121.6 ns  )
folds/mapAccumL/4096                     mean 2.473 μs  ( +- 170.2 ns  )
folds/mapAccumL/8192                     mean 5.519 μs  ( +- 776.2 ns  )
folds/mapAccumL/16384                    mean 9.496 μs  ( +- 476.1 ns  )
folds/mapAccumL/32768                    mean 18.62 μs  ( +- 426.0 ns  )
folds/mapAccumL/65536                    mean 38.73 μs  ( +- 2.563 μs  )
folds/mapAccumR/1                        mean 17.08 ns  ( +- 2.783 ns  )
folds/mapAccumR/2                        mean 16.32 ns  ( +- 391.4 ps  )
folds/mapAccumR/4                        mean 17.77 ns  ( +- 1.125 ns  )
folds/mapAccumR/8                        mean 21.61 ns  ( +- 766.7 ps  )
folds/mapAccumR/16                       mean 30.00 ns  ( +- 350.5 ps  )
folds/mapAccumR/32                       mean 47.87 ns  ( +- 1.112 ns  )
folds/mapAccumR/64                       mean 82.98 ns  ( +- 1.198 ns  )
folds/mapAccumR/128                      mean 157.9 ns  ( +- 2.743 ns  )
folds/mapAccumR/256                      mean 298.5 ns  ( +- 6.054 ns  )
folds/mapAccumR/512                      mean 577.8 ns  ( +- 13.06 ns  )
folds/mapAccumR/1024                     mean 1.150 μs  ( +- 38.62 ns  )
folds/mapAccumR/2048                     mean 2.267 μs  ( +- 48.44 ns  )
folds/mapAccumR/4096                     mean 4.489 μs  ( +- 82.57 ns  )
folds/mapAccumR/8192                     mean 8.948 μs  ( +- 98.88 ns  )
folds/mapAccumR/16384                    mean 18.18 μs  ( +- 649.8 ns  )
folds/mapAccumR/32768                    mean 36.17 μs  ( +- 3.148 μs  )
folds/mapAccumR/65536                    mean 71.53 μs  ( +- 1.207 μs  )
folds/scanl/1                            mean 12.39 ns  ( +- 383.9 ps  )
folds/scanl/2                            mean 12.94 ns  ( +- 400.1 ps  )
folds/scanl/4                            mean 13.64 ns  ( +- 133.6 ps  )
folds/scanl/8                            mean 15.95 ns  ( +- 415.1 ps  )
folds/scanl/16                           mean 20.26 ns  ( +- 206.5 ps  )
folds/scanl/32                           mean 29.06 ns  ( +- 382.8 ps  )
folds/scanl/64                           mean 53.94 ns  ( +- 2.462 ns  )
folds/scanl/128                          mean 90.58 ns  ( +- 1.535 ns  )
folds/scanl/256                          mean 163.7 ns  ( +- 1.800 ns  )
folds/scanl/512                          mean 309.7 ns  ( +- 7.857 ns  )
folds/scanl/1024                         mean 607.5 ns  ( +- 16.33 ns  )
folds/scanl/2048                         mean 1.171 μs  ( +- 19.57 ns  )
folds/scanl/4096                         mean 2.282 μs  ( +- 19.19 ns  )
folds/scanl/8192                         mean 4.655 μs  ( +- 66.07 ns  )
folds/scanl/16384                        mean 9.477 μs  ( +- 130.7 ns  )
folds/scanl/32768                        mean 18.55 μs  ( +- 256.7 ns  )
folds/scanl/65536                        mean 37.64 μs  ( +- 433.1 ns  )
folds/scanr/1                            mean 12.84 ns  ( +- 394.0 ps  )
folds/scanr/2                            mean 13.23 ns  ( +- 216.6 ps  )
folds/scanr/4                            mean 14.27 ns  ( +- 152.0 ps  )
folds/scanr/8                            mean 16.92 ns  ( +- 335.7 ps  )
folds/scanr/16                           mean 21.30 ns  ( +- 701.1 ps  )
folds/scanr/32                           mean 29.99 ns  ( +- 1.620 ns  )
folds/scanr/64                           mean 47.68 ns  ( +- 840.1 ps  )
folds/scanr/128                          mean 88.02 ns  ( +- 1.001 ns  )
folds/scanr/256                          mean 159.3 ns  ( +- 1.947 ns  )
folds/scanr/512                          mean 301.2 ns  ( +- 5.620 ns  )
folds/scanr/1024                         mean 585.9 ns  ( +- 9.126 ns  )
folds/scanr/2048                         mean 1.154 μs  ( +- 13.38 ns  )
folds/scanr/4096                         mean 2.289 μs  ( +- 45.71 ns  )
folds/scanr/8192                         mean 4.546 μs  ( +- 125.8 ns  )
folds/scanr/16384                        mean 9.065 μs  ( +- 125.0 ns  )
folds/scanr/32768                        mean 17.91 μs  ( +- 219.2 ns  )
folds/scanr/65536                        mean 36.12 μs  ( +- 1.025 μs  )
folds/filter/1                           mean 21.27 ns  ( +- 460.9 ps  )
folds/filter/2                           mean 22.52 ns  ( +- 276.1 ps  )
folds/filter/4                           mean 24.51 ns  ( +- 386.5 ps  )
folds/filter/8                           mean 27.42 ns  ( +- 760.0 ps  )
folds/filter/16                          mean 32.30 ns  ( +- 811.0 ps  )
folds/filter/32                          mean 41.19 ns  ( +- 852.5 ps  )
folds/filter/64                          mean 64.63 ns  ( +- 783.9 ps  )
folds/filter/128                         mean 105.0 ns  ( +- 2.174 ns  )
folds/filter/256                         mean 178.7 ns  ( +- 3.316 ns  )
folds/filter/512                         mean 355.5 ns  ( +- 59.48 ns  )
folds/filter/1024                        mean 624.0 ns  ( +- 14.94 ns  )
folds/filter/2048                        mean 1.187 μs  ( +- 13.77 ns  )
folds/filter/4096                        mean 2.429 μs  ( +- 31.81 ns  )
folds/filter/8192                        mean 4.780 μs  ( +- 108.2 ns  )
folds/filter/16384                       mean 9.308 μs  ( +- 145.0 ns  )
folds/filter/32768                       mean 18.51 μs  ( +- 205.9 ns  )
folds/filter/65536                       mean 36.94 μs  ( +- 404.2 ns  )

@Boarders
Copy link
Contributor Author

I added your other suggestions too @Bodigrim and have put it all into one commit.

@Bodigrim
Copy link
Contributor

I do not quite understand why scan{l,r} and filter became so drastically faster. Other examples here (fold{l,r}{,'}, mapAccum{L,R}) are polymorphic, so inlining monomorphises definitions with understandable performance benefits. But scan{l,r} and filter are monomorphic already!

@Boarders
Copy link
Contributor Author

My only guess is that filter/scan getting inlined means that the predicate/accumulator can also get inlined and that can be much better optimized later in the pipeline that core. I really don't know enough to say though.

@Bodigrim
Copy link
Contributor

Yeah, sounds plausible, probably inlining a predicate allows to avoid boxing/unboxing of Word8.

@Bodigrim
Copy link
Contributor

What I'm trying to grasp is how much of the speed up can be attributed purely to the deficiencies of eta-reduced expressions used in benchmarks.

It's probably correct that the most drastic impact is not due to inlining itself, but due to complete elimination of memory allocation. It would be nice to have a robust way to utilise this, not relying on fragile pragams. Is it an evidence to expose a set of functions, taking callbacks of form Word8# -> Bool, Word8# -> Word8# -> Word8#, etc?

@ethercrow
Copy link
Contributor

ethercrow commented Jan 15, 2021

taking callbacks of form Word8# -> Bool, Word8# -> Word8# -> Word8#

These two callbacks can be encoded as a 256 bit mask and a uint8_t[256*256] lookup table respectively and the implementation of some folds can be done in C entirely.

@Bodigrim
Copy link
Contributor

@ethercrow yeah, interesting idea. One could have a Template Haskell routine, precomputing masks/lookup tables during compilation and generating the most efficient code for each particular case.

@Boarders could you please check, whether replacing INLINE by INLINEABLE is sufficient for all functions in this PR?

@Boarders
Copy link
Contributor Author

Boarders commented Jan 15, 2021

@Bodigrim: These are the results if we just replace INLINE by INLINABLE:
folds/foldl'/1                           mean 5.787 ns  ( +- 380.2 ps  )
folds/foldl'/2                           mean 7.098 ns  ( +- 1.000 ns  )
folds/foldl'/4                           mean 8.625 ns  ( +- 1.562 ns  )
folds/foldl'/8                           mean 10.71 ns  ( +- 1.260 ns  )
folds/foldl'/16                          mean 15.21 ns  ( +- 1.051 ns  )
folds/foldl'/32                          mean 23.44 ns  ( +- 896.9 ps  )
folds/foldl'/64                          mean 42.02 ns  ( +- 3.887 ns  )
folds/foldl'/128                         mean 83.60 ns  ( +- 3.602 ns  )
folds/foldl'/256                         mean 152.5 ns  ( +- 6.105 ns  )
folds/foldl'/512                         mean 295.7 ns  ( +- 20.20 ns  )
folds/foldl'/1024                        mean 576.7 ns  ( +- 31.48 ns  )
folds/foldl'/2048                        mean 1.134 μs  ( +- 26.00 ns  )
folds/foldl'/4096                        mean 2.242 μs  ( +- 44.67 ns  )
folds/foldl'/8192                        mean 4.487 μs  ( +- 91.94 ns  )
folds/foldl'/16384                       mean 9.210 μs  ( +- 276.8 ns  )
folds/foldl'/32768                       mean 19.07 μs  ( +- 1.515 μs  )
folds/foldl'/65536                       mean 36.44 μs  ( +- 1.154 μs  )
folds/foldr'/1                           mean 4.344 ns  ( +- 174.4 ps  )
folds/foldr'/2                           mean 5.000 ns  ( +- 373.0 ps  )
folds/foldr'/4                           mean 5.565 ns  ( +- 318.3 ps  )
folds/foldr'/8                           mean 7.440 ns  ( +- 816.6 ps  )
folds/foldr'/16                          mean 10.61 ns  ( +- 358.6 ps  )
folds/foldr'/32                          mean 16.63 ns  ( +- 769.3 ps  )
folds/foldr'/64                          mean 28.06 ns  ( +- 905.8 ps  )
folds/foldr'/128                         mean 58.35 ns  ( +- 2.624 ns  )
folds/foldr'/256                         mean 103.4 ns  ( +- 4.358 ns  )
folds/foldr'/512                         mean 196.0 ns  ( +- 5.829 ns  )
folds/foldr'/1024                        mean 379.8 ns  ( +- 11.23 ns  )
folds/foldr'/2048                        mean 753.8 ns  ( +- 27.88 ns  )
folds/foldr'/4096                        mean 1.456 μs  ( +- 27.78 ns  )
folds/foldr'/8192                        mean 2.935 μs  ( +- 98.35 ns  )
folds/foldr'/16384                       mean 5.925 μs  ( +- 216.7 ns  )
folds/foldr'/32768                       mean 11.72 μs  ( +- 301.1 ns  )
folds/foldr'/65536                       mean 23.35 μs  ( +- 811.0 ns  )
folds/mapAccumL/1                        mean 33.05 ns  ( +- 2.362 ns  )
folds/mapAccumL/2                        mean 49.55 ns  ( +- 2.798 ns  )
folds/mapAccumL/4                        mean 83.91 ns  ( +- 3.287 ns  )
folds/mapAccumL/8                        mean 147.0 ns  ( +- 4.765 ns  )
folds/mapAccumL/16                       mean 292.8 ns  ( +- 25.99 ns  )
folds/mapAccumL/32                       mean 594.5 ns  ( +- 64.55 ns  )
folds/mapAccumL/64                       mean 1.081 μs  ( +- 27.89 ns  )
folds/mapAccumL/128                      mean 2.197 μs  ( +- 177.4 ns  )
folds/mapAccumL/256                      mean 4.066 μs  ( +- 130.9 ns  )
folds/mapAccumL/512                      mean 8.392 μs  ( +- 774.9 ns  )
folds/mapAccumL/1024                     mean 17.41 μs  ( +- 1.124 μs  )
folds/mapAccumL/2048                     mean 34.31 μs  ( +- 1.375 μs  )
folds/mapAccumL/4096                     mean 65.06 μs  ( +- 3.101 μs  )
folds/mapAccumL/8192                     mean 131.7 μs  ( +- 10.68 μs  )
folds/mapAccumL/16384                    mean 270.7 μs  ( +- 5.441 μs  )
folds/mapAccumL/32768                    mean 541.6 μs  ( +- 21.66 μs  )
folds/mapAccumL/65536                    mean 1.077 ms  ( +- 100.4 μs  )
folds/mapAccumR/1                        mean 32.48 ns  ( +- 1.468 ns  )
folds/mapAccumR/2                        mean 48.87 ns  ( +- 4.247 ns  )
folds/mapAccumR/4                        mean 79.49 ns  ( +- 5.679 ns  )
folds/mapAccumR/8                        mean 146.0 ns  ( +- 9.599 ns  )
folds/mapAccumR/16                       mean 268.6 ns  ( +- 7.887 ns  )
folds/mapAccumR/32                       mean 512.6 ns  ( +- 27.86 ns  )
folds/mapAccumR/64                       mean 978.5 ns  ( +- 29.74 ns  )
folds/mapAccumR/128                      mean 1.956 μs  ( +- 101.3 ns  )
folds/mapAccumR/256                      mean 3.845 μs  ( +- 106.3 ns  )
folds/mapAccumR/512                      mean 7.900 μs  ( +- 587.6 ns  )
folds/mapAccumR/1024                     mean 16.18 μs  ( +- 865.5 ns  )
folds/mapAccumR/2048                     mean 31.95 μs  ( +- 4.890 μs  )
folds/mapAccumR/4096                     mean 64.77 μs  ( +- 3.241 μs  )
folds/mapAccumR/8192                     mean 151.1 μs  ( +- 22.03 μs  )
folds/mapAccumR/16384                    mean 321.9 μs  ( +- 60.21 μs  )
folds/mapAccumR/32768                    mean 553.8 μs  ( +- 44.48 μs  )
folds/mapAccumR/65536                    mean 1.327 ms  ( +- 163.8 μs  )
folds/scanl/1                            mean 21.73 ns  ( +- 2.015 ns  )
folds/scanl/2                            mean 27.63 ns  ( +- 1.790 ns  )
folds/scanl/4                            mean 39.25 ns  ( +- 1.072 ns  )
folds/scanl/8                            mean 68.94 ns  ( +- 8.732 ns  )
folds/scanl/16                           mean 119.5 ns  ( +- 8.361 ns  )
folds/scanl/32                           mean 249.8 ns  ( +- 26.17 ns  )
folds/scanl/64                           mean 446.0 ns  ( +- 37.09 ns  )
folds/scanl/128                          mean 825.6 ns  ( +- 39.39 ns  )
folds/scanl/256                          mean 1.694 μs  ( +- 101.7 ns  )
folds/scanl/512                          mean 3.464 μs  ( +- 407.1 ns  )
folds/scanl/1024                         mean 6.772 μs  ( +- 660.9 ns  )
folds/scanl/2048                         mean 12.92 μs  ( +- 691.6 ns  )
folds/scanl/4096                         mean 25.95 μs  ( +- 2.115 μs  )
folds/scanl/8192                         mean 49.66 μs  ( +- 2.079 μs  )
folds/scanl/16384                        mean 101.1 μs  ( +- 4.548 μs  )
folds/scanl/32768                        mean 208.1 μs  ( +- 9.402 μs  )
folds/scanl/65536                        mean 425.4 μs  ( +- 15.08 μs  )
folds/scanr/1                            mean 19.91 ns  ( +- 711.9 ps  )
folds/scanr/2                            mean 26.36 ns  ( +- 1.234 ns  )
folds/scanr/4                            mean 38.25 ns  ( +- 1.568 ns  )
folds/scanr/8                            mean 61.90 ns  ( +- 1.102 ns  )
folds/scanr/16                           mean 112.9 ns  ( +- 3.190 ns  )
folds/scanr/32                           mean 219.7 ns  ( +- 9.814 ns  )
folds/scanr/64                           mean 420.8 ns  ( +- 20.91 ns  )
folds/scanr/128                          mean 913.0 ns  ( +- 79.22 ns  )
folds/scanr/256                          mean 1.735 μs  ( +- 106.9 ns  )
folds/scanr/512                          mean 3.312 μs  ( +- 175.8 ns  )
folds/scanr/1024                         mean 6.487 μs  ( +- 422.6 ns  )
folds/scanr/2048                         mean 12.90 μs  ( +- 318.4 ns  )
folds/scanr/4096                         mean 25.65 μs  ( +- 880.0 ns  )
folds/scanr/8192                         mean 50.50 μs  ( +- 3.794 μs  )
folds/scanr/16384                        mean 100.1 μs  ( +- 2.196 μs  )
folds/scanr/32768                        mean 206.7 μs  ( +- 9.113 μs  )
folds/scanr/65536                        mean 410.2 μs  ( +- 15.70 μs  )
folds/filter/1                           mean 28.61 ns  ( +- 940.9 ps  )
folds/filter/2                           mean 33.83 ns  ( +- 1.747 ns  )
folds/filter/4                           mean 44.98 ns  ( +- 3.248 ns  )
folds/filter/8                           mean 66.40 ns  ( +- 4.899 ns  )
folds/filter/16                          mean 99.45 ns  ( +- 6.722 ns  )
folds/filter/32                          mean 178.5 ns  ( +- 4.149 ns  )
folds/filter/64                          mean 330.8 ns  ( +- 16.03 ns  )
folds/filter/128                         mean 616.5 ns  ( +- 9.632 ns  )
folds/filter/256                         mean 1.203 μs  ( +- 25.84 ns  )
folds/filter/512                         mean 2.404 μs  ( +- 33.80 ns  )
folds/filter/1024                        mean 4.799 μs  ( +- 180.7 ns  )
folds/filter/2048                        mean 9.456 μs  ( +- 315.2 ns  )
folds/filter/4096                        mean 18.99 μs  ( +- 496.4 ns  )
folds/filter/8192                        mean 37.54 μs  ( +- 610.1 ns  )
folds/filter/16384                       mean 74.99 μs  ( +- 1.257 μs  )
folds/filter/32768                       mean 151.1 μs  ( +- 13.75 μs  )
folds/filter/65536                       mean 297.9 μs  ( +- 5.205 μs  )

In particular, it looks like it works for foldl' and foldr' but for nothing else. I wish I knew a better reason why that is the case.

@Bodigrim
Copy link
Contributor

OK, let's deal with folds first. Hopefully, it will provide us with some insight.

To summarise discussed above and in tickets #23 and #329, we have definitions like

foldl' f z bs = <...>
{-# INLINE foldl' #-}

and benchmarks of form

benchFoldl' = foldl' (+) 0

This synthetic benchmark does not perform well, because foldl' is not saturated and thus does not get inlined into benchFoldl'. One can fix this rewriting foldl' f z = \bs = <...>, which is done in this PR. But also, quite surprisingly, changing INLINE to INLINABLE or even removing the pragma improves performance as well, actually forcing foldl' to inline (even while its application is not saturated). What's going on?

CC @hsyl20 @treeowl

@Bodigrim
Copy link
Contributor

Bodigrim commented Feb 5, 2021

Curiouser and curiouser! In GHC 9.0 performance gains for foldl' / foldr' disappear (benchmarks are as slow as in master), but other functions are still 70-80% faster.

@sjakobi
Copy link
Member

sjakobi commented Feb 6, 2021

In GHC 9.0 performance gains for foldl' / foldr' disappear (benchmarks are as slow as in master)

If they've become slower than they were with GHC 8.10, that would probably be worth reporting to the GHC devs.

@Bodigrim
Copy link
Contributor

Bodigrim commented Feb 6, 2021

I assume fold{l,r}' do not get inlined any longer, even after changes in this branch. It would be nice to investigate thoroughly, but I'm short of resources at the moment.

@hsyl20
Copy link
Contributor

hsyl20 commented Feb 8, 2021

I've not looked in details, but it could be related to https://gitlab.haskell.org/ghc/ghc/-/issues/18993

@Bodigrim Bodigrim mentioned this pull request Feb 20, 2021
6 tasks
@Boarders
Copy link
Contributor Author

This same issue has been posted about on GHC issue trackers for foldl' for list:
https://gitlab.haskell.org/ghc/ghc/-/issues/19534

@Bodigrim
Copy link
Contributor

Let's merge this.

Bottom line is that this patch increases inlining opportunities, is the most conservative way to do so, and does not make anything worse. Benchmarks are clearly in favor of it. And base is changing its definitions in the same way.

While it would be interesting to understand why {-# INLINABLE #-} or removing pragma at all tricks GHC 8.10 to inline even without lambda introduction, this is not a sustainable solution. Inlining is fragile enough, and I'd rather follow general recomendation to ensure that applications are saturated and put {-# INLINE #-}, as done in this PR.

When this is merged, we can raise a question about GHC 9.0 regression in GHC issue tracker. There is a bunch of inlining regressions in 9.0.1, I wonder if 9.0.2 would be better.

@Bodigrim Bodigrim requested a review from sjakobi May 16, 2021 13:58
@Boarders
Copy link
Contributor Author

@Bodigrim When I next get a free cycle(which shouldn't be too long) I will experiment with your unboxed callback idea and @ethercrow's suggestion to see what they do for the benchmarks. For the latte, I will merely say whether the improvement is particularly good using some naive approach and then make an issue where it can be decided how best to utilise that technique.

@Bodigrim Bodigrim merged commit cdf6ebb into haskell:master May 19, 2021
@Bodigrim
Copy link
Contributor

Thanks, @Boarders!

@Bodigrim Bodigrim added this to the 0.11.2.0 milestone May 19, 2021
Bodigrim pushed a commit to Bodigrim/bytestring that referenced this pull request May 22, 2021
noughtmare pushed a commit to noughtmare/bytestring that referenced this pull request Dec 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

foldl' doens't inline on basic example Performance over Data.Vector
5 participants