Skip to content

Conversation

bgamari
Copy link
Collaborator

@bgamari bgamari commented Jan 15, 2015

With haskell/bytestring#40 resolved it should finally be possible to move binary to bytestring's Builder. This addresses #37.

The following is a comparison of benchmarks/Builder.hs with three GHC versions. The first column is the mean time of the Data.Binary.Builder, the second is the mean of Data.ByteString.Builder, and the third is the percent change relative to Binary.Builder.

There are still a couple of sticky spots but I think these are made up for by the great improvements made elsewhere. I'll try having a look at these sticky spots soon.

GHC 7.6
  bounds/[Word8]                                         :       130.18 us          50.82 us      -61.0%
  "Host endian/1MB of Word32 in chunks of 16"            :       270.20 us         272.34 us      +0.8%
  "Host endian/1MB of Word8 in chunks of 16"             :      2313.06 us         690.88 us      -70.1%
  "small ByteString"                                     :         0.24 us           0.20 us      -19.1%
  [Word8]                                                :        63.07 us          44.71 us      -29.1%
  "length-prefixed ByteString"                           :         8.88 us           5.36 us      -39.6%
  "Host endian/1MB of Word16 in chunks of 16"            :       344.25 us         409.49 us      +19.0%
  "large ByteString"                                     :         0.24 us           0.20 us      -18.6%
  "Host endian/1MB of Word64 in chunks of 16"            :       141.04 us         154.66 us      +9.7%

GHC 7.8
  bounds/[Word8]                                         :       411.85 us         149.93 us      -63.6%
  "Host endian/1MB of Word32 in chunks of 16"            :      1148.21 us         883.50 us      -23.1%
  "Host endian/1MB of Word8 in chunks of 16"             :     18334.59 us        3744.35 us      -79.6%
  "small ByteString"                                     :         0.32 us           0.21 us      -35.0%
  [Word8]                                                :       191.56 us         118.85 us      -38.0%
  "length-prefixed ByteString"                           :         9.17 us           4.94 us      -46.1%
  "Host endian/1MB of Word16 in chunks of 16"            :      2134.92 us        1657.38 us      -22.4%
  "large ByteString"                                     :         0.32 us           0.21 us      -35.7%
  "Host endian/1MB of Word64 in chunks of 16"            :       680.70 us         453.09 us      -33.4%

GHC 7.10
  bounds/[Word8]                                         :       134.83 us          52.63 us      -61.0%
  "Host endian/1MB of Word32 in chunks of 16"            :       230.38 us         270.01 us      +17.2%
  "Host endian/1MB of Word8 in chunks of 16"             :      2399.44 us         685.21 us      -71.4%
  "small ByteString"                                     :         0.25 us           0.22 us      -12.0%
  [Word8]                                                :        62.84 us          45.92 us      -26.9%
  "length-prefixed ByteString"                           :         9.89 us           5.20 us      -47.4%
  "Host endian/1MB of Word16 in chunks of 16"            :       321.24 us         401.79 us      +25.1%
  "large ByteString"                                     :         0.25 us           0.19 us      -21.4%
  "Host endian/1MB of Word64 in chunks of 16"            :       148.73 us         154.96 us      +4.2%

@bgamari
Copy link
Collaborator Author

bgamari commented Jan 15, 2015

@tibbe, @dcoutts you may also be interested in this.

@dcoutts
Copy link
Contributor

dcoutts commented Jan 15, 2015

Good stuff.

@kolmodin I'd like your involvement.

In principle this is the direction we've wanted to go, once bytestring builder was sufficiently widely deployed. And now that the perf issue that @kolmodin found before seems to be resolved then seems like we should get on with it.

Any API compat issues that we need to think about?

@tibbe
Copy link
Member

tibbe commented Jan 15, 2015

What's the cause of the regressions in 7.10?
On Jan 15, 2015 7:57 PM, "Ben Gamari" notifications@github.com wrote:

@tibbe https://github.com/tibbe, @dcoutts https://github.com/dcoutts
you may also be interested in this.


Reply to this email directly or view it on GitHub
#65 (comment).

@dcoutts
Copy link
Contributor

dcoutts commented Jan 15, 2015

As a follow-up we should look at the Data.Binary.Builder and see if we ought to rationalise the names to really just be re-exports of Data.ByteString.Builder.

@dcoutts
Copy link
Contributor

dcoutts commented Jan 15, 2015

@tibbe If I was following correctly, it was more that the Data.Binary.Builder got dramatically faster with 7.10 while the bytestring one did not. But with @bgamari 's haskell/bytestring#40 fix, the bytestring one gets faster too with 7.10.

@bgamari
Copy link
Collaborator Author

bgamari commented Jan 15, 2015

There are definitely some questions regarding how we want to merge this. Without haskell/bytestring#40 we regress quite heavily on GHC 7.6 and 7.10. Below is a comparison between Data.Binary.Builder (first column) and bytestring's Builder without the fix (second column),

GHC 7.6
  bounds/[Word8]                                         :       131.26 us         214.18 us      +63.2%
  "Host endian/1MB of Word32 in chunks of 16"            :       226.21 us        1053.66 us      +365.8%
  "Host endian/1MB of Word8 in chunks of 16"             :      2642.61 us        4460.94 us      +68.8%
  "small ByteString"                                     :         0.24 us           0.20 us      -17.8%
  [Word8]                                                :        63.09 us         172.24 us      +173.0%
  "length-prefixed ByteString"                           :         9.12 us           5.18 us      -43.2%
  "Host endian/1MB of Word16 in chunks of 16"            :       324.26 us        2000.55 us      +517.0%
  "large ByteString"                                     :         0.25 us           0.19 us      -22.2%
  "Host endian/1MB of Word64 in chunks of 16"            :       134.11 us         545.12 us      +306.5%

GHC 7.8
  bounds/[Word8]                                         :       391.55 us         213.20 us      -45.5%
  "Host endian/1MB of Word32 in chunks of 16"            :      1127.54 us        1057.80 us      -6.2%
  "Host endian/1MB of Word8 in chunks of 16"             :     17799.62 us        4427.87 us      -75.1%
  "small ByteString"                                     :         0.31 us           0.19 us      -36.7%
  [Word8]                                                :       180.97 us         175.08 us      -3.3%
  "length-prefixed ByteString"                           :         8.81 us           4.90 us      -44.4%
  "Host endian/1MB of Word16 in chunks of 16"            :      2551.97 us        2016.27 us      -21.0%
  "large ByteString"                                     :         0.31 us           0.20 us      -36.8%
  "Host endian/1MB of Word64 in chunks of 16"            :       565.32 us         541.24 us      -4.3%

GHC 7.10
  bounds/[Word8]                                         :       182.19 us         212.08 us      +16.4%
  "Host endian/1MB of Word32 in chunks of 16"            :       235.19 us        1057.43 us      +349.6%
  "Host endian/1MB of Word8 in chunks of 16"             :      2300.05 us        4408.89 us      +91.7%
  "small ByteString"                                     :         0.25 us           0.19 us      -23.3%
  [Word8]                                                :        63.88 us         172.94 us      +170.7%
  "length-prefixed ByteString"                           :         8.90 us           5.26 us      -40.9%
  "Host endian/1MB of Word16 in chunks of 16"            :       335.58 us        2001.86 us      +496.5%
  "large ByteString"                                     :         0.24 us           0.19 us      -21.8%
  "Host endian/1MB of Word64 in chunks of 16"            :       139.97 us         550.78 us      +293.5%

GHC 7.8 is quite consistent. With GHC 7.6 and 7.10, however, the small regressions with the fix balloon into several-hundred-percent regressions without. This all suggests that it would likely be good to at least understand these regressions that remain after haskell/bytestring#40 before we move on this.

If it looks like further fixing these will require further changes in bytestring then we might consider setting a restrictive lower bound in binary to ensure that users don't end up with severe performance regressions. Alternatively we could even hold off on ripping out binary's existing Builder until the bytestring fix has had time to propagate.

@tibbe
Copy link
Member

tibbe commented Jan 15, 2015

Aside: from your first set of benchmarks this is a curious result:

GHC 7.10
  "Host endian/1MB of Word32 in chunks of 16"            :       230.38 us         270.01 us      +17.2%
  "Host endian/1MB of Word8 in chunks of 16"             :      2399.44 us         685.21 us      -71.4%

These benchmarks look quite similar at a first glance, but binary performs widely different one these.

@bgamari
Copy link
Collaborator Author

bgamari commented Jan 15, 2015

@tibbe that is indeed quite interesting. Could this be the result of unaligned accesses?

@tibbe
Copy link
Member

tibbe commented Jan 15, 2015

It's indeed tricky to define binary's builder in terms of bytestring, since bytestring will only be fast in the very last release. We'd have to require bytestring >= <last release> or people will be unhappy when they run into performance problems.

Perhaps we could keep the old implementation inside some Compat modules inside binary until such a time we don't need to support older bytestring versions anymore. The builder type will need to be kept abstract, so users can't tell which implementation we're using. Not an ideal solution, but I can't think of any better one.

@tibbe
Copy link
Member

tibbe commented Jan 15, 2015

@tibbe that is indeed quite interesting. Could this be the result of unaligned accesses?

Alignment shouldn't an issue for Word8, so I don't think that's it. I'd have to look at the Core.

@bgamari
Copy link
Collaborator Author

bgamari commented Jan 15, 2015

@tibbe the other way to look at it is that the numbers with a slow bytestring are no worse than the binary builder was when built with 7.8. I haven't heard any screams claiming poor binary performance in the last 12 months so we could arguably just call it a day. The only ones who will experience large regressions are those running on 7.6.

@bgamari
Copy link
Collaborator Author

bgamari commented Jan 15, 2015

@tibbe it seems that GHC for some reason decides to float all of the address calculations away from the stores (core, assembly). Crazy compiler.

@bgamari
Copy link
Collaborator Author

bgamari commented Jan 15, 2015

Interestingly enough, this appears to happen during the SpecConstr pass.

edit I was wrong, it's actually the simplifier as one might expect. This arises after the initial Float out pass which introduces sharing between the sequential integer literals used by the two branches of the buffer-full caseanalysis. After this the compiler never chooses to push the literals back in to their respective branches.

Presumably it feels as though expressions of the form (W8# (narrow8Word# (plusWord# ipv_s1b2m (__word 1))) are too expensive to duplicate?

@bgamari
Copy link
Collaborator Author

bgamari commented Jan 15, 2015

@tibbe, I put up some notes here.

@bgamari
Copy link
Collaborator Author

bgamari commented Jan 15, 2015

Unfortunately fixing this may require GHC changes. It seems that GHC introduces sharing when it floats the __integer literals up to top level. This is all well and good, but then it later fails to float them back in due to extremely restrictive heuristics in the FloatIn pass. In order to float a binding in to a case it seems the binding must both be sufficiently small and not used in all branches.

While our binding is almost certainly small enough, it is used in all (two!) branches of the case.

I'm not really sure of what to blame this on. While from a human's perspective floating out something like (fromInteger @ Word8 $fNumWord8 (__integer 1)) is silly, the compiler likely has no idea of how expensive fromInteger at this stage in the compilation. Moreover, while it seems pretty clear that an absurdly light floater like narrow8Word# (plusWord# ww_s1bSC (__word 5)) should be floated in, I can see that these heuristics may be rather fragile.

@kolmodin
Copy link
Member

This is good work. In the branch I had I used CPP to rely on the ByteString
builder only of it was available, but maybe the BS builder is available
enough already. That branch though, I think, is on a machine currently
packed in boxes. I'm in the process of relocating, which occupies all my
time at the moment. I'll be able to look closer in a few days.
This should probably not go into the binary for ghc 7.10 since it was not
part of the RC. At least unless we can get a better understanding of the
benchmark.
On 16 Jan 2015 01:22, "Ben Gamari" notifications@github.com wrote:

Unfortunately fixing this may require GHC changes. It seems that GHC
introduces sharing when it floats the __integer literals up to top level.
This is all well and good, but then it later fails to float them back in
due to extremely restrictive heuristics in the FloatIn pass. In order to
float a binding in to a case it seems the binding must both be
sufficiently small and not used in all branches.

While our binding is almost certainly small enough, it is used in all
(two!) branches of the case.

I'm not really sure of what to blame this on. While from a human's
perspective floating out something like (fromInteger @ Word8 $fNumWord8
(__integer 1)) is silly, the compiler likely has no idea of how expensive
fromInteger at this stage in the compilation. Moreover, while it seems
pretty clear that an absurdly light floater like narrow8Word# (plusWord#
ww_s1bSC (__word 5)) should be floated in, I can see that these
heuristics may be rather fragile.


Reply to this email directly or view it on GitHub
#65 (comment).

@bgamari
Copy link
Collaborator Author

bgamari commented Jan 16, 2015

With this
on GHC c71fb84b8c9ec9c1e279df8c75ceb8a537801aa1 I find that things
look much more reasonable in the benchmarks. The first column below is
the Binary.Builder and the second is ByteString.Builder. As
expected things are substantially faster with ByteString.

Of course, it's totally unclear that this patch is the fix that we
want. At very least I need to ensure that it doesn't cause unexpected
code size blow-ups in other code.

GHC 7.11
  "Host endian/1MB of Word16 in chunks of 16"            :       262.48 us         206.52 us      -21.3%
  "Host endian/1MB of Word32 in chunks of 16"            :       178.51 us         106.08 us      -40.6%
  "Host endian/1MB of Word64 in chunks of 16"            :       108.60 us          63.85 us      -41.2%
  "Host endian/1MB of Word8 in chunks of 16"             :      1847.27 us         372.19 us      -79.9%
  "large ByteString"                                     :         0.16 us           0.15 us      -2.9%
  "length-prefixed ByteString"                           :         7.42 us           4.45 us      -40.0%
  "small ByteString"                                     :         0.15 us           0.15 us      +0.5%
  [Word8]                                                :        49.74 us          39.48 us      -20.6%
  bounds/[Word8]                                         :       109.82 us          47.88 us      -56.4%

@bgamari
Copy link
Collaborator Author

bgamari commented Jan 22, 2015

The cause of the "Host endian/1MB of Word8 in chunks of 16" benchmark's slowness is being tracked as https://ghc.haskell.org/trac/ghc/ticket/10012

@kolmodin
Copy link
Member

kolmodin commented Jun 3, 2015

@bgamari Did your fix go into GHC?
Also I'm curious whether you had any tool support to create the benchmark table?

@bgamari
Copy link
Collaborator Author

bgamari commented Jun 9, 2015

@kolmodin it did not make it upstream nor is it entirely clear that it's the right solution as it is a very big hammer which may have unintended consequences. At this point this work is blocked on my finishing my thesis and finding a free weekend or two.

Indeed, I have a small Python hack which takes the CSV output from Criterion and renders a table.

@hvr
Copy link
Member

hvr commented Dec 20, 2015

bump

What's the current situation of this one? Is this GHC 8.0 material?

@kolmodin
Copy link
Member

Unfortunately it's not. @bgamari explains in #65 (comment) that it depends on a patch to GHC that has not been merged.

@bgamari
Copy link
Collaborator Author

bgamari commented Dec 21, 2015

Indeed; this is because we really don't know the right way to fix this yet. However, I'll add this ticket to my todo list. It seems likely that doing a better job optimizing this case could help a substantial amount of code.

@bgamari bgamari force-pushed the bytestring-builder branch from be31bd7 to 93a49d0 Compare March 23, 2016 22:47
@bgamari bgamari closed this Mar 23, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants