Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ideas for better bytestring builder #194

Open
fumieval opened this issue Dec 18, 2019 · 17 comments
Open

Ideas for better bytestring builder #194

fumieval opened this issue Dec 18, 2019 · 17 comments

Comments

@fumieval
Copy link
Contributor

fumieval commented Dec 18, 2019

As https://github.com/haskell-perf/strict-bytestring-builders pointed out, current Data.ByteString.Builder tends to be slow especially when primitives are small.

Some alternatives have good benchmark results (cf https://www.reddit.com/r/haskell/comments/e6yg76/mason_faster_bytestring_builder/ ,https://github.com/fumieval/mason#performance, http://hackage.haskell.org/package/fast-builder); we may be able to take some tricks from those.

It also makes sense to expand the API so that

  • Can build strict ByteString directly
  • return the length after hPutBuilder

I think it's important that the internal API are exposed. At the moment it's difficult to send the contents of Builder over a socket due to BuildSignal being opaque.

@fumieval fumieval changed the title Better bytestring builder? Ideas for better bytestring builder Dec 18, 2019
@hvr
Copy link
Member

hvr commented Dec 19, 2019

@fumieval I've seen your work and I've been meaning to reach out to you about it; it does look promising but I'd like to better understand its trade-offs; is it really strictly superior in all cases, or can you point out pathological cases where the current builder would be significantly (which may not necessarily be blockers) at a disadvantage?

Is it possible to switch to your implementation while being 100% API compatible with the old impl? (i.e. same type-sigs; no observable semantic differences )

@fumieval
Copy link
Contributor Author

bytestring's Builder tends to be slow when working with small primitives (e.g. Word8). From my small benchmark:

bytestring                               mean 247.6 ns  ( +- 12.06 ns  )
bytestring-builder                       mean 188.7 ns  ( +- 34.65 ns  )
bytestring-builder/toStrict              mean 166.8 ns  ( +- 3.603 ns  )
fast-builder/lazy                        mean 67.43 ns  ( +- 20.60 ns  )
fast-builder/strict                      mean 41.29 ns  ( +- 2.150 ns  )
mason/lazy                               mean 378.2 ns  ( +- 5.022 ns  )
mason/strict                             mean 30.44 ns  ( +- 916.5 ps  )

AFAIK both fast-builder and mason have 100% compatible API. However, fast-builder has a [caveat] (http://hackage.haskell.org/package/fast-builder-0.1.2.0/docs/Data-ByteString-FastBuilder.html#v:toLazyByteString) about the performance of toLazyByteString and mason's toLazyByteString, implemented in a similar way, is still poor in some cases.

@hvr
Copy link
Member

hvr commented Jan 12, 2020

Hrm; about the performance penalty of toLazyByteString; do you know why it regresses? which cases specifically suffer most? (i.e. are those pathological, or are those cases actually relevant in real-world cases?)

@fumieval
Copy link
Contributor Author

Because toLazyByteString forks another thread to run the builder action, and the caller thread retrieves the chunks it yields via MVar. fast-builder avoids this problem by delaying forking until it reaches 32K bytes.

@hvr
Copy link
Member

hvr commented Jan 12, 2020

@fumieval so are you saying that the difference/penalty is mostly noticeable for small-ish serializations (i.e. in the <= 32KiB range)?

@fumieval
Copy link
Contributor Author

Yes, currently it has a big (constant?) overhead. If the whole process takes more than 100μs, I think mason gets faster than others (as shown in https://github.com/fumieval/mason#performance) in most cases.

@hvr
Copy link
Member

hvr commented Jan 13, 2020

@fumieval what if the caller could provide a size-hint of the expected output buffer size (i.e. an integer argument)? would it then be possible to mitigate the constant-overhead and get on par with the current builders performance?

@fumieval
Copy link
Contributor Author

I think the problem really is the overhead of thread creation, not about the buffer size. fast-builder implements a rather complicated trick to avoid this issue: http://hackage.haskell.org/package/fast-builder-0.1.2.0/docs/src/Data.ByteString.FastBuilder.Internal.html#toLazyByteStringWith

@yaitskov
Copy link

yaitskov commented May 20, 2020

I wasn't able to replace Builder with Mason.
fumieval/mason#2

Looks like fast builder doesn't provide options to wrap lazy string into builder.

@fumieval
Copy link
Contributor Author

A new approach based on delimited continuations seems promising: ghc-proposals/ghc-proposals#313 (comment)

@chessai
Copy link
Member

chessai commented Jun 25, 2020

@fumieval haven't looked at a lot of your work (which is rather broad btw, thank you), but have you considered something along the lines of builder or small-bytearray-builder. These could be re-worked to use Ptr rather easily, and I'm interested to see how they stack up.

Additionally, @andrewthad has done a lot of thinking about builders and may have some interesting input here.

@andrewthad
Copy link
Contributor

In bytebuild, I use GC-managed ByteArray# for everything instead of Addr#. Also, I don't use CPS. In bytes-builder-shootout, which uses backpack for perf benchmarking, I have observed that for encoding (recursive) tree-like structures, bytebuild outperforms bytestring, but fast-builder outperforms them both.

@fumieval
Copy link
Contributor Author

@chessai small-bytearray-builder's design looks very interesting, but both builder and small-bytearray-builder produce fully strict byte arrays only, while we need streaming behaviour for toLazyByteString and hPutBuilder.

@andrewthad I didn't know about bytes-builder-shootout, that looks useful! If mason works as intended (primitives are well fused and inlined), producing strict ByteString should be faster than fast-builder. I'll try adding mason to bytes-builder-shootout

@fumieval
Copy link
Contributor Author

I added mason to bytes-builder-shootout. the latest release of mason wasn't very good in this benchmark:

treeToHex-2000/small-bytearray-builder   mean 43.27 μs  ( +- 167.3 ns  )
treeToHex-2000/fast-builder              mean 34.22 μs  ( +- 95.85 ns  )
treeToHex-2000/bytestring                mean 60.76 μs  ( +- 3.829 μs  )
treeToHex-2000/mason                     mean 60.34 μs  ( +- 182.0 ns  )
treeToHex-9000/small-bytearray-builder   mean 190.1 μs  ( +- 761.5 ns  )
treeToHex-9000/bytestring                mean 289.9 μs  ( +- 1.700 μs  )
treeToHex-9000/mason                     mean 267.9 μs  ( +- 1.023 μs  )
short-text-tree-1000/small-bytearray-builder mean 27.30 μs  ( +- 142.7 ns  )
short-text-tree-1000/fast-builder        mean 37.36 μs  ( +- 639.8 ns  )
short-text-tree-1000/bytestring          mean 37.17 μs  ( +- 1.763 μs  )
short-text-tree-1000/mason               mean 29.42 μs  ( +- 422.5 ns  )
byte-tree-2000/small-bytearray-builder   mean 30.64 μs  ( +- 131.9 ns  )
byte-tree-2000/fast-builder              mean 28.82 μs  ( +- 114.9 ns  )
byte-tree-2000/bytestring                mean 53.86 μs  ( +- 314.2 ns  )
byte-tree-2000/mason                     mean 40.45 μs  ( +- 1.252 μs  )

I investigated the code and it turns out that the lack of worker-wrapper transformation (which isn't necessary in fast-builder and small-bytearray-builder) is slowing it down. Flattening the internal structure drastically improved the speed.

treeToHex-2000/mason                     mean 37.96 μs  ( +- 306.3 ns  )
treeToHex-9000/mason                     mean 167.6 μs  ( +- 719.6 ns  )
short-text-tree-1000/mason               mean 27.48 μs  ( +- 151.2 ns  )
byte-tree-2000/mason                     mean 32.12 μs  ( +- 170.2 ns  )

@Bodigrim
Copy link
Contributor

@fumieval is there anything actionable here?

@fumieval
Copy link
Contributor Author

Unfortunately, mason's performance regressed a lot since GHC 9.2, and I didn't manage to figure out why. I decided to put the project on hold until delimited continuation primops arrives.

@BebeSparkelSparkel
Copy link
Contributor

@fumieval Delimited Continuations has merged https://gitlab.haskell.org/ghc/ghc/-/merge_requests/7942

Any new progress?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants