Sliced interface for ShortByteString #541

oberblastmeister · 2022-08-16T16:05:35Z

I think we could take some inspiration from byteslice. This would also address #193, by expanding the api for ByteArray based types instead of changing ByteString to use ByteArray.

The docs for ShortByteString say

It is suitable for use as an internal representation for code that needs to keep many short strings in memory, but it should not be used as an interchange type. That is, it should not generally be used in public APIs.

Having a sliced ShortByteString would allow using a ByteArray based type in public APIs, similar to ByteString.

The text was updated successfully, but these errors were encountered:

Bodigrim · 2022-08-16T21:41:33Z

I'm reluctant to extend bytestring API even further, it's already difficult to maintain in sync and there is no particular reason for SlicedShortByteString to be here, since it is not used by other boot libraries.

I think pushing forward haskell/primitive#354 and #410 is a more sustainable approach: when merged, byteslice will operate over slices of ShortByteString.

oberblastmeister · 2022-08-16T22:53:41Z

In my opinion implementing functions directly on ShortByteString is not optimal. Functions on ShortByteString can be implemented trivially by using slices and should have no overhead because of inlining and simplification. Also, the functions currently copy by default which I think should be more explicit, such as by copying the sliced variant. In theory we could move all the functions from ShortByteString to the sliced variant and have conversion functions for people that want to use ShortByteString, and it would be okay. For example the any function can be easily converted for slices

any :: (Word8 -> Bool) -> Slice -> Bool
any k = \Slice {ba, off, len} ->
  let
      w = indexWord8Array ba
      go !n | n >= len    = False
            | otherwise = k (w n) || go (n + 1)
  in go off

and now any for ShortByteString is easy

any :: (Word8 -> Bool) -> ShortByteString -> Bool
any f = Slice.any f . asSlice

So most of the code is just moved instead of duplicated, and I don't think it would be that much harder to maintain, but of course that is your decision. I think adding this type would help unify vector, bytestring, and text because ByteArray can be cheaply converted between them. This would also encourage people to use ByteArray based types instead of ByteString which always has to be pinned.

Bodigrim · 2022-08-17T19:05:15Z

@oberblastmeister that's understandable goal, but as long as ByteArray from base is shared by all core packages, someone can implement Slice outside of bytestring. I agree that this is likely to cause certain code duplication, but at the moment I do not have bandwidth to include Slice into bytestring itself, at least not without seeing a prior art elsewhere.

@sjakobi what do you think?

oberblastmeister · 2022-08-18T20:11:40Z

I went looking through filepath as that is the primary consumer of ShortByteString. A large amount of functions are ineficient from copying. The most egregious example is splitDirectories. It would be pretty funny if the new ByteArray based functions were slower than the String based ones. On Windows, it is, by $13 \mu s$! The String version is almost 2 times faster than the ByteArray version! On posix, the situation isn't much better, with splitDirectories taking $7 \mu s$. Windows functions seem to suffer much more from copying because of the larger path size. A quick analysis of splitDirectores gives 112 total copies. The benchmark uses a path with 28 segments, splitPath does 3 copies for each segment and dropTrailingPathSeparator does about 1 also. And it is even worse, splitPath copies the accumulator. This makes the effect of copies even larger, as we are basically getting $n^2$ copying behavior. With slices, there would be zero copies. Clearly, we need slices to improve performance.

Bodigrim · 2022-08-18T21:25:55Z

Well, Slice adds 16 byte overhead for each chunk, which in a typical use case of filepath is worse than copying content. Your benchmark with 28 segments looks a bit contrived to me, and FWIW splitDirectories can use (Int, Int, ShortByteString) internally to prevent quadratic behaviour, and copy only resulting chunks, not accumulator.

More generally, filepath can define data Slice = Slice Int Int ShortByteString and use it internally to optimize intermediate operations. I think this is the best balance of cost/benefit right now. If it works well, in future we can discuss this issue further.

clyring · 2022-08-18T21:35:16Z

If the cost of copying some bytearray is significant, it usually makes sense to tell the GC not to copy it, by pinning that bytearray. So I don't find the idea of sliced-unpinned-bytestring all that persuasive on the face of it. (Of course, ByteString is typically still a bit more memory-expensive than simple sliced-pinned-bytearrays would be because its representation involves an extra two-word PlainPtr heap object. But I hope to eventually improve GHC so that PlainPtr does not actually allocate a new heap object.)

But in many cases I expect the observed "copying cost" to actually be dominated by the allocation cost of the buffers in which the copies are placed. And unlike copying in itself this is unaffected by pinnedness. The allocation overhead for String is (with the obvious exceptions) truly horrible: So horrible that for typical or even somewhat-large filepath inputs it doesn't surprise me that $O(n)$ string vs $O(n^2)$ shortbytestring allocations the shortbytestrings can still perform comparably or better.

I've been thinking for a while about removing most of the near-duplication between our ByteString and ShortByteString code, by writing both with identical combinators at different types. I expect that doing so would make it trivial to add an unpinned-slices variant.

oberblastmeister · 2022-08-18T21:56:21Z

Your benchmark with 28 segments looks a bit contrived to me

It's from filepath.

More generally, filepath can define data Slice = Slice Int Int ShortByteString and use it internally to optimize intermediate operations. I think this is the best balance of cost/benefit right now. If it works well, in future we can discuss this issue further.

Sounds good.

I've been thinking for a while about removing most of the near-duplication between our ByteString and ShortByteString code, by writing both with identical combinators at different types. I expect that doing so would make it trivial to add an unpinned-slices variant.

How would you do this? Something like type classes?

clyring · 2022-08-26T02:37:01Z

I've been thinking for a while about removing most of the near-duplication between our ByteString and ShortByteString code, by writing both with identical combinators at different types. I expect that doing so would make it trivial to add an unpinned-slices variant.

How would you do this? Something like type classes?

Type classes can do this, but then un-specialized versions of every function will be generated even though they should never be used. Backpack is perhaps the "correct" tool for the job, but it's not available with the older versions of GHC and Cabal still supported by bytestring, and would also probably require import cycles to use in this way. So I had in mind to use CPP to share the implementations, if/when I get around to this.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sliced interface for ShortByteString #541

Sliced interface for ShortByteString #541

oberblastmeister commented Aug 16, 2022

Bodigrim commented Aug 16, 2022

oberblastmeister commented Aug 16, 2022

Bodigrim commented Aug 17, 2022

oberblastmeister commented Aug 18, 2022

Bodigrim commented Aug 18, 2022

clyring commented Aug 18, 2022

oberblastmeister commented Aug 18, 2022

clyring commented Aug 26, 2022

Sliced interface for ShortByteString #541

Sliced interface for ShortByteString #541

Comments

oberblastmeister commented Aug 16, 2022

Bodigrim commented Aug 16, 2022

oberblastmeister commented Aug 16, 2022

Bodigrim commented Aug 17, 2022

oberblastmeister commented Aug 18, 2022

Bodigrim commented Aug 18, 2022

clyring commented Aug 18, 2022

oberblastmeister commented Aug 18, 2022

clyring commented Aug 26, 2022