Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove INLINE pragma for default implementations of Store methods #94

Merged
merged 1 commit into from Feb 14, 2017
Merged

Remove INLINE pragma for default implementations of Store methods #94

merged 1 commit into from Feb 14, 2017

Conversation

RyanGlScott
Copy link
Contributor

As discussed in #91 (and in https://ghc.haskell.org/trac/ghc/ticket/13059), store appeared to take much more memory to compile between GHC 8.0.1 and 8.0.2, jumping from 1.6 GB to 5.17 GB. I finally nailed down why this seemed to appear only in GHC 8.0.2 in https://ghc.haskell.org/trac/ghc/ticket/13059#comment:20. I encourage you to read that if you are curious, but the tl;dr version is that GHC 8.0.1 and earlier incorrectly dropped INLINE pragmas for default method implementations, and when 8.0.2 fixed this bug, it dramatically increased the amount of inlining that happens in store, causing the memory spike.

We can at least go back to the "status quo" of GHC 8.0.1 and earlier by removing the INLINE pragmas for the default implementations of Store's class methods, which restores the maximum residency to about 1.6 GB for both GHC 8.0.1 and 8.0.2.

Of course, you might be concerned that this could impact performance - just in case, I ran the store benchmarks with and without this change on a 4-core, 64-bit Linux desktop with 16 GB of RAM.

Here are the results before this PR:

benchmarking encode/ (Vector Int)
time                 1.388 μs   (1.383 μs .. 1.396 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 1.384 μs   (1.383 μs .. 1.388 μs)
std dev              6.055 ns   (1.824 ns .. 12.54 ns)

benchmarking encode/1kb storable (Vector Int32)
time                 92.25 ns   (92.15 ns .. 92.39 ns)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 92.39 ns   (92.28 ns .. 92.50 ns)
std dev              370.9 ps   (308.3 ps .. 463.2 ps)

benchmarking encode/10kb storable (Vector Int32)
time                 567.2 ns   (566.1 ns .. 568.3 ns)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 565.9 ns   (565.1 ns .. 566.9 ns)
std dev              2.928 ns   (2.232 ns .. 3.901 ns)

benchmarking encode/1kb normal (Vector Int32)
time                 3.461 μs   (3.458 μs .. 3.465 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 3.466 μs   (3.462 μs .. 3.475 μs)
std dev              18.32 ns   (10.97 ns .. 30.80 ns)

benchmarking encode/10kb normal (Vector Int32)
time                 34.42 μs   (34.40 μs .. 34.45 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 34.47 μs   (34.42 μs .. 34.56 μs)
std dev              218.5 ns   (118.0 ns .. 335.5 ns)

benchmarking encode/ (Vector SmallProduct)
time                 1.849 μs   (1.846 μs .. 1.851 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 1.847 μs   (1.845 μs .. 1.851 μs)
std dev              8.251 ns   (3.478 ns .. 14.62 ns)

benchmarking encode/ (Vector SmallProductManual)
time                 1.845 μs   (1.843 μs .. 1.847 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 1.853 μs   (1.847 μs .. 1.859 μs)
std dev              19.57 ns   (14.50 ns .. 23.77 ns)

benchmarking encode/ (Vector SmallSum)
time                 2.504 μs   (2.496 μs .. 2.512 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 2.500 μs   (2.494 μs .. 2.506 μs)
std dev              20.83 ns   (17.81 ns .. 24.78 ns)

benchmarking encode/ (Vector SmallSumManual)
time                 2.746 μs   (2.724 μs .. 2.767 μs)
                     0.999 R²   (0.999 R² .. 1.000 R²)
mean                 2.741 μs   (2.720 μs .. 2.762 μs)
std dev              69.60 ns   (59.34 ns .. 86.78 ns)
variance introduced by outliers: 31% (moderately inflated)

benchmarking encode/ (Vector ((Int,Int),(Int,Int)))
time                 2.000 μs   (1.999 μs .. 2.000 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 2.000 μs   (2.000 μs .. 2.001 μs)
std dev              2.202 ns   (1.610 ns .. 3.227 ns)

benchmarking encode/ (Vector SomeData)
time                 1.488 μs   (1.488 μs .. 1.489 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 1.489 μs   (1.488 μs .. 1.490 μs)
std dev              2.625 ns   (1.676 ns .. 4.248 ns)

benchmarking decode/ (Vector Int)
time                 712.2 ns   (711.9 ns .. 712.6 ns)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 712.8 ns   (712.1 ns .. 714.6 ns)
std dev              3.532 ns   (1.881 ns .. 6.675 ns)

benchmarking decode/1kb storable (Vector Int32)
time                 85.96 ns   (85.88 ns .. 86.06 ns)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 86.23 ns   (86.04 ns .. 86.51 ns)
std dev              777.2 ps   (587.1 ps .. 966.2 ps)

benchmarking decode/10kb storable (Vector Int32)
time                 521.2 ns   (518.6 ns .. 523.5 ns)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 521.9 ns   (520.9 ns .. 522.4 ns)
std dev              2.279 ns   (1.124 ns .. 4.624 ns)

benchmarking decode/1kb normal (Vector Int32)
time                 1.769 μs   (1.768 μs .. 1.769 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 1.769 μs   (1.767 μs .. 1.769 μs)
std dev              3.520 ns   (2.769 ns .. 4.630 ns)

benchmarking decode/10kb normal (Vector Int32)
time                 17.60 μs   (17.49 μs .. 17.77 μs)
                     0.999 R²   (0.999 R² .. 1.000 R²)
mean                 17.77 μs   (17.66 μs .. 17.89 μs)
std dev              391.7 ns   (324.7 ns .. 480.3 ns)
variance introduced by outliers: 21% (moderately inflated)

benchmarking decode/ (Vector SmallProduct)
time                 1.496 μs   (1.495 μs .. 1.498 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 1.506 μs   (1.499 μs .. 1.516 μs)
std dev              26.43 ns   (19.35 ns .. 34.73 ns)
variance introduced by outliers: 19% (moderately inflated)

benchmarking decode/ (Vector SmallProductManual)
time                 1.869 μs   (1.868 μs .. 1.871 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 1.869 μs   (1.868 μs .. 1.870 μs)
std dev              2.737 ns   (1.511 ns .. 4.495 ns)

benchmarking decode/ (Vector SmallSum)
time                 1.813 μs   (1.812 μs .. 1.816 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 1.815 μs   (1.813 μs .. 1.820 μs)
std dev              11.53 ns   (4.624 ns .. 18.85 ns)

benchmarking decode/ (Vector SmallSumManual)
time                 1.214 μs   (1.213 μs .. 1.215 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 1.215 μs   (1.214 μs .. 1.216 μs)
std dev              4.002 ns   (2.856 ns .. 6.074 ns)

benchmarking decode/ (Vector ((Int,Int),(Int,Int)))
time                 1.703 μs   (1.700 μs .. 1.707 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 1.709 μs   (1.706 μs .. 1.712 μs)
std dev              8.555 ns   (6.162 ns .. 12.57 ns)

benchmarking decode/ (Vector SomeData)
time                 850.8 ns   (849.5 ns .. 852.4 ns)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 849.8 ns   (849.2 ns .. 850.6 ns)
std dev              2.375 ns   (1.628 ns .. 3.618 ns)

Here are the results after this PR:

benchmarking encode/ (Vector Int)
time                 1.375 μs   (1.375 μs .. 1.376 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 1.376 μs   (1.376 μs .. 1.377 μs)
std dev              1.585 ns   (1.316 ns .. 1.945 ns)

benchmarking encode/1kb storable (Vector Int32)
time                 91.61 ns   (91.53 ns .. 91.67 ns)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 91.15 ns   (90.92 ns .. 91.35 ns)
std dev              730.1 ps   (641.9 ps .. 810.1 ps)

benchmarking encode/10kb storable (Vector Int32)
time                 545.8 ns   (544.5 ns .. 547.5 ns)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 547.1 ns   (546.3 ns .. 547.6 ns)
std dev              1.960 ns   (1.295 ns .. 2.666 ns)

benchmarking encode/1kb normal (Vector Int32)
time                 3.445 μs   (3.444 μs .. 3.446 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 3.448 μs   (3.446 μs .. 3.450 μs)
std dev              5.749 ns   (3.655 ns .. 9.812 ns)

benchmarking encode/10kb normal (Vector Int32)
time                 34.30 μs   (34.28 μs .. 34.32 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 34.31 μs   (34.30 μs .. 34.33 μs)
std dev              50.09 ns   (34.26 ns .. 72.53 ns)

benchmarking encode/ (Vector SmallProduct)
time                 1.721 μs   (1.713 μs .. 1.731 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 1.715 μs   (1.712 μs .. 1.721 μs)
std dev              13.85 ns   (8.962 ns .. 18.44 ns)

benchmarking encode/ (Vector SmallProductManual)
time                 1.814 μs   (1.812 μs .. 1.817 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 1.814 μs   (1.812 μs .. 1.818 μs)
std dev              9.385 ns   (4.859 ns .. 17.05 ns)

benchmarking encode/ (Vector SmallSum)
time                 2.427 μs   (2.424 μs .. 2.430 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 2.429 μs   (2.426 μs .. 2.434 μs)
std dev              13.38 ns   (9.744 ns .. 18.96 ns)

benchmarking encode/ (Vector SmallSumManual)
time                 2.410 μs   (2.409 μs .. 2.412 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 2.410 μs   (2.409 μs .. 2.412 μs)
std dev              5.450 ns   (4.473 ns .. 6.727 ns)

benchmarking encode/ (Vector ((Int,Int),(Int,Int)))
time                 1.953 μs   (1.952 μs .. 1.953 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 1.954 μs   (1.953 μs .. 1.955 μs)
std dev              2.612 ns   (1.963 ns .. 3.702 ns)

benchmarking encode/ (Vector SomeData)
time                 1.459 μs   (1.455 μs .. 1.465 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 1.457 μs   (1.456 μs .. 1.461 μs)
std dev              7.744 ns   (4.185 ns .. 13.24 ns)

benchmarking decode/ (Vector Int)
time                 683.0 ns   (682.0 ns .. 684.4 ns)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 682.5 ns   (682.0 ns .. 683.7 ns)
std dev              2.567 ns   (1.215 ns .. 4.543 ns)

benchmarking decode/1kb storable (Vector Int32)
time                 86.05 ns   (85.92 ns .. 86.14 ns)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 85.75 ns   (85.61 ns .. 85.89 ns)
std dev              495.2 ps   (427.0 ps .. 616.5 ps)

benchmarking decode/10kb storable (Vector Int32)
time                 520.1 ns   (519.9 ns .. 520.3 ns)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 520.2 ns   (520.1 ns .. 520.5 ns)
std dev              688.4 ps   (529.6 ps .. 955.2 ps)

benchmarking decode/1kb normal (Vector Int32)
time                 1.707 μs   (1.706 μs .. 1.708 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 1.707 μs   (1.705 μs .. 1.709 μs)
std dev              5.461 ns   (3.831 ns .. 7.529 ns)

benchmarking decode/10kb normal (Vector Int32)
time                 18.00 μs   (17.93 μs .. 18.08 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 18.07 μs   (18.01 μs .. 18.15 μs)
std dev              246.7 ns   (194.0 ns .. 313.3 ns)

benchmarking decode/ (Vector SmallProduct)
time                 2.918 μs   (2.913 μs .. 2.927 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 2.962 μs   (2.946 μs .. 2.986 μs)
std dev              62.47 ns   (54.03 ns .. 71.29 ns)
variance introduced by outliers: 23% (moderately inflated)

benchmarking decode/ (Vector SmallProductManual)
time                 1.799 μs   (1.798 μs .. 1.800 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 1.799 μs   (1.798 μs .. 1.800 μs)
std dev              2.748 ns   (2.272 ns .. 3.453 ns)

benchmarking decode/ (Vector SmallSum)
time                 1.923 μs   (1.921 μs .. 1.925 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 1.921 μs   (1.921 μs .. 1.922 μs)
std dev              3.272 ns   (2.586 ns .. 3.965 ns)

benchmarking decode/ (Vector SmallSumManual)
time                 1.210 μs   (1.209 μs .. 1.210 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 1.209 μs   (1.208 μs .. 1.210 μs)
std dev              4.152 ns   (3.330 ns .. 5.234 ns)

benchmarking decode/ (Vector ((Int,Int),(Int,Int)))
time                 1.700 μs   (1.699 μs .. 1.701 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 1.704 μs   (1.700 μs .. 1.710 μs)
std dev              16.68 ns   (13.32 ns .. 19.45 ns)

benchmarking decode/ (Vector SomeData)
time                 2.464 μs   (2.463 μs .. 2.465 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 2.464 μs   (2.463 μs .. 2.465 μs)
std dev              3.415 ns   (2.502 ns .. 5.377 ns)

It doesn't appear that different to my (admittedly untrained) eye.

Fixes #91.

@mgsloan
Copy link
Owner

mgsloan commented Feb 14, 2017

It does seem to make the decode/ (Vector SomeData) benchmark about 3x slower, ~2.46 microseconds vs ~0.85 microseconds.

It does make sense to avoid excessive compilation time / memory, though, so merging this. Thanks so much for the thorough investigation and fix!

@mgsloan mgsloan merged commit 2636e87 into mgsloan:master Feb 14, 2017
@sjakobi
Copy link
Contributor

sjakobi commented Feb 14, 2017

decode/ (Vector SmallProduct) got about 2x slower too, while decode/ (Vector SmallSum) took only a ~6% hit. These are exactly those benchmarks that use the generic peek implementation.

I wonder if it would make sense to put the INLINE pragma back on peek but hide it behind a fastGenericPeek flag (which we can use for official stack releases)?

@mgsloan
Copy link
Owner

mgsloan commented Feb 16, 2017

It may well make sense to have such a flag. For now, releasing as v0.3.1. It'd be interesting to benchmark how this change affects stack's use of store!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

High memory usage during compilation using GHC 8.0.2-rc2
3 participants