Speed up Data.ByteString.Short.unpack #526

hasufell · 2022-07-03T19:52:00Z

Following the discussion from #524 (comment)

Before:

$ cabal run -w ghc-9.2.3 --enable-profiling bytestring-bench --ghc-options='-fcheck-prim-bounds -fno-ignore-asserts' -- -p 'unpack' +RTS -s
Up to date
All
  ShortByteString
    ShortByteString unpack
      unpack and look at first 100 elements: OK (0.87s)
        55.6 ms ± 4.8 ms,  66 MB allocated, 242 B  copied, 181 MB peak memory
      unpackLast:                          OK (0.99s)
        63.1 ms ± 5.2 ms,  66 MB allocated, 226 B  copied, 181 MB peak memory

All 2 tests passed (1.96s)
   2,246,950,896 bytes allocated in the heap
           7,568 bytes copied during GC
      71,372,544 bytes maximum residency (29 sample(s))
       1,339,648 bytes maximum slop
             181 MiB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0        43 colls,     0 par    1.132s   1.136s     0.0264s    0.0535s
  Gen  1        29 colls,     0 par    0.319s   0.320s     0.0110s    0.0535s

  INIT    time    0.001s  (  0.001s elapsed)
  MUT     time    0.551s  (  0.554s elapsed)
  GC      time    1.451s  (  1.457s elapsed)
  RP      time    0.000s  (  0.000s elapsed)
  PROF    time    0.000s  (  0.000s elapsed)
  EXIT    time    0.000s  (  0.000s elapsed)
  Total   time    2.003s  (  2.011s elapsed)

  %GC     time       0.0%  (0.0% elapsed)

  Alloc rate    4,075,230,737 bytes per MUT second

  Productivity  27.5% of total user, 27.5% of total elapsed

After

All
  ShortByteString
    ShortByteString unpack
      unpack and look at first 100 elements: OK (0.23s)
        1.38 μs ± 103 ns, 6.9 KB allocated,   0 B  copied,  47 MB peak memory
      unpackLast:                          OK (0.15s)
        22.6 ms ± 2.2 ms, 132 MB allocated, 436 B  copied,  47 MB peak memory

All 2 tests passed (0.41s)
   2,033,165,224 bytes allocated in the heap
           7,984 bytes copied during GC
       2,342,464 bytes maximum residency (21 sample(s))
         257,800 bytes maximum slop
              47 MiB total memory in use (1 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0        55 colls,     0 par    0.003s   0.003s     0.0000s    0.0022s
  Gen  1        21 colls,     0 par    0.025s   0.025s     0.0012s    0.0028s

  INIT    time    0.001s  (  0.001s elapsed)
  MUT     time    0.425s  (  0.428s elapsed)
  GC      time    0.028s  (  0.028s elapsed)
  RP      time    0.000s  (  0.000s elapsed)
  PROF    time    0.000s  (  0.000s elapsed)
  EXIT    time    0.000s  (  0.000s elapsed)
  Total   time    0.454s  (  0.456s elapsed)

  %GC     time       0.0%  (0.0% elapsed)

  Alloc rate    4,780,369,992 bytes per MUT second

  Productivity  93.8% of total user, 93.8% of total elapsed

I'm not sure what are the benefits of the old implementation. This one seems to be better in every regard and probably fuses better.

Data/ByteString/Short/Internal.hs

Bodigrim · 2022-07-03T21:37:49Z

Git history reveals that Data.ByteString.Short.Internal.unpackAppend{Chars,Bytes}{Lazy,Strict} were added in 9e2814f, copied from Data.ByteString.Internal.unpackAppend{Chars,Bytes}{Lazy,Strict}.

bytestring/Data/ByteString/Internal.hs

Lines 454 to 457 in 2b9416d

    
           -- Unpacking bytestrings into lists efficiently is a tradeoff: on the one hand 
        
           -- we would like to write a tight loop that just blasts the list into memory, on 
        
           -- the other hand we want it to be unpacked lazily so we don't end up with a 
        
           -- massive list data structure in memory.

bytestring/Data/ByteString/Internal.hs

Lines 494 to 502 in 2b9416d

    
           unpackAppendBytesStrict :: ByteString -> [Word8] -> [Word8] 
        
           unpackAppendBytesStrict (BS fp len) xs = 
        
               accursedUnutterablePerformIO $ unsafeWithForeignPtr fp $ \base -> 
        
                 loop (base `plusPtr` (-1)) (base `plusPtr` (-1+len)) xs 
        
             where 
        
               loop !sentinal !p acc 
        
                 | p == sentinal = return acc 
        
                 | otherwise     = do x <- peek p 
        
                                      loop sentinal (p `plusPtr` (-1)) (x:acc)

This is a reasonable strategy for ByteStrings, because indexing them involves accursedUnutterablePerformIO $ unsafeWithForeignPtr fp, which is a non-negligible cost if performed per each element. The same does not hold for ShortByteString, because indexWord8Array is as cheap as it gets. Thanks for noticing this, @hasufell!

hasufell · 2022-07-03T22:04:34Z

My laptop might not be too reliable about performance tests, but I ran it twice and it seems that the strict foldr got consistently slower: https://gist.github.com/hasufell/33c0736e4fcdd1be4896c08322f191f1

hasufell · 2022-07-03T23:00:24Z

Core:

-- RHS size: {terms: 36, types: 12, coercions: 0, joins: 0/2}
$wunpack :: ByteArray# -> [Word8]
$wunpack
  = \ (ww :: ByteArray#) ->
      let {
        y :: Int#
        y = -# (sizeofByteArray# ww) 1# } in
      case ># 0# y of {
        __DEFAULT ->
          letrec {
            go3 :: Int# -> [Word8]
            go3
              = \ (x :: Int#) ->
                  : (case indexWord8Array# ww x of wild { __DEFAULT -> W8# wild })
                    (case ==# x y of {
                       __DEFAULT -> go3 (+# x 1#);
                       1# -> []
                     }); } in
          go3 0#;
        1# -> []
      }

-- RHS size: {terms: 6, types: 3, coercions: 0, joins: 0/0}
unpack :: ShortByteString -> [Word8]
unpack
  = \ (w :: ShortByteString) -> case w of { SBS ww1 -> $wunpack ww1 }

-- RHS size: {terms: 30, types: 25, coercions: 0, joins: 1/1}
$wfoldr' :: forall {a}. (Word8 -> a -> a) -> a -> ByteArray# -> a
$wfoldr'
  = \ (@a) (w :: Word8 -> a -> a) (w1 :: a) (ww :: ByteArray#) ->
      joinrec {
        go1 :: [Word8] -> (a -> a) -> a -> a
        go1 (ds :: [Word8]) (eta :: a -> a) (eta1 :: a)
          = case ds of {
              [] -> eta eta1;
              : y ys ->
                jump go1
                  ys (\ (z :: a) -> case w y z of vx { __DEFAULT -> eta vx }) eta1
            }; } in
      jump go1 ($wunpack ww) id w1

-- RHS size: {terms: 11, types: 9, coercions: 0, joins: 0/0}
foldr' :: forall a. (Word8 -> a -> a) -> a -> ShortByteString -> a
foldr'
  = \ (@a)
      (w :: Word8 -> a -> a)
      (w1 :: a)
      (w2 :: ShortByteString) ->
      case w2 of { SBS ww1 -> $wfoldr' w w1 ww1 }

sjakobi · 2022-07-04T12:11:31Z

Could you change foldr' so it operates on the ByteArray# directly instead of consuming it via unpack?

I don't find it very surprising that a foldr' that consumes a list isn't efficient.

hasufell · 2022-07-04T12:17:18Z

Could you change foldr' so it operates on the ByteArray# directly instead of consuming it via unpack?

I don't find it very surprising that a foldr' that consumes a list isn't efficient.

But why is it faster with the previous version?

sjakobi · 2022-07-04T12:28:03Z

Comparing the Core between the old and new versions, both for foldr' itself and the foldr' benchmark might reveal the issue.

hasufell · 2022-07-04T14:18:37Z

Here are the cores: https://gist.github.com/hasufell/653dea9cc5ad0f0d6cd35e47442fe35b

hasufell · 2022-07-04T15:01:23Z

Even regular foldr is slower.

I was able to speed them up by writing explicit versions:

foldr :: (Word8 -> a -> a) -> a -> ShortByteString -> a
foldr k v = \sbs ->
  let l  = length sbs
      ba = asBA sbs
      w  = indexWord8Array ba
      go !n | n >= l    = v
            | otherwise = k (w n) (go (n + 1))
  in go 0

-- TODO: could write tail recursive right to left version
foldr' :: (Word8 -> a -> a) -> a -> ShortByteString -> a
foldr' k v = \sbs ->
  let l  = length sbs
      ba = asBA sbs
      w  = indexWord8Array ba
      go !n | n >= l    = v
            | otherwise = let !acc = go (n + 1)
                          in k (w n) acc
  in go 0

but doesn't that indicate that there's a problem with list fusion?

hasufell · 2022-07-04T15:20:48Z

Ok, so lazy foldr can also be sped up by marking both unpack and unpackBytes with INLINE. That suggests that -fexpose-all-unfoldings is not really enough and that unpack doesn't fuse at all (which seems pretty bad).

That only leaves strict foldr'. No combination of inlining had improved this, so I guess we can go with the manual version here.

That leaves that question whether we have other functions that don't fuse/inline properly as well.

hasufell · 2022-07-04T22:44:52Z

So, here are the benchmark results of the latest patchset, compared with the baseline: https://gist.github.com/hasufell/3710e86fcdfcd5f2c641e7d0a57e30ce

folds (strict and lazy), filter, unfoldrN are faster
raw unpack when just forcing the entire result is slower (kinda expected, since we have more thunks now?)
unpack fuses much better now though (e.g. last . unpack or take 120 . unpack are much faster)
added a lot of new INLINE pragmas (they had an effect)
rest of the functions are mostly the same speed

clyring · 2022-07-14T18:15:21Z

[The existing implementation] is a reasonable strategy for ByteStrings, because indexing them involves accursedUnutterablePerformIO $ unsafeWithForeignPtr fp, which is a non-negligible cost if performed per each element. The same does not hold for ShortByteString, because indexWord8Array is as cheap as it gets. Thanks for noticing this, @hasufell!

The need to use accursedUnutterablePerformIO and unsafeWithForeignPtr does not impose any runtime cost on unsafeIndex. (unsafeWithForeignPtr might have a small cost in the current unpackBytes implementation if prevents the loop entry from being optimized to a tail call. But I haven't checked this.)

The performance characteristics of unpack for ByteString and ShortByteString should be very similar. What's good for one is very likely good for the other. The current implementation's goal is to maximize throughput when a large ByteString is unpacked and no list-fusion is possible. It tries to do this by allocating 99% fewer thunks than a fully lazy implementation would, while still allocating the result in small chunks so that a traversal can benefit from the machine's L1 cache.

If I'm reading the benchmark results @hasufell shared in the previous comment correctly, the proposed implementation takes about 75% longer in this case, which is a rather large regression. One alternative worth considering is adapting the existing fusion mechanism for Data.ByteString.unpack to the other unpack variants in bytestring.

Also, the benchmark results @hasufell has reported for 'unpack and look at a small number of leading elements' here and in #524 seem unbelievably large. Reading 100 bytes, allocating a hundred cons cells, and inspecting a few of them should take on the order of a microsecond, not on the order of ten milliseconds. (And it could be reasonably argued that even starting with 100 bytes is too much.) That's four orders of magnitude! I haven't tried reproducing or investigating yet, but something is very wrong somewhere.

hasufell · 2022-07-15T22:30:57Z

@clyring when I use this implementation:

-- | /O(n)/. Convert a 'ShortByteString' into a list.
unpack :: ShortByteString -> [Word8]
unpack sbs = build (unpackFoldr sbs)
{-# INLINE unpack #-}

--
-- Have unpack fuse with good list consumers
--
unpackFoldr :: ShortByteString -> (Word8 -> a -> a) -> a -> a
unpackFoldr sbs k z = foldr k z sbs
{-# INLINE [0] unpackFoldr #-}

{-# RULES
"ShortByteString unpack-list" [1]  forall bs .
    unpackFoldr bs (:) [] = unpackBytes bs
 #-}

with the existing 100 chunk unpackBytes implementation, I get these benchmark results compared to bytestring master:

$ cabal run  bytestring-bench  -- -p 'All.ShortByteString.ShortByteString conversions' --baseline out_master.csv
Up to date
All
  ShortByteString
    ShortByteString conversions
      unpack
        1:                               OK (0.24s)
          18.6 ns ± 1.3 ns, 15% slower than baseline
        2:                               OK (0.27s)
          25.0 ns ± 1.4 ns, 17% slower than baseline
        4:                               OK (0.24s)
          41.6 ns ± 2.8 ns, 23% slower than baseline
        8:                               OK (0.38s)
          73.8 ns ± 4.8 ns, 24% slower than baseline
        16:                              OK (0.20s)
          138  ns ±  11 ns, 18% slower than baseline
        32:                              OK (0.19s)
          267  ns ±  26 ns, 14% slower than baseline
        64:                              OK (0.34s)
          551  ns ±  31 ns, 12% slower than baseline
        128:                             OK (0.31s)
          1.00 μs ±  77 ns
        256:                             OK (0.28s)
          1.92 μs ± 104 ns
        512:                             OK (0.29s)
          3.75 μs ± 232 ns
        1024:                            OK (0.29s)
          7.64 μs ± 365 ns
        2048:                            OK (0.16s)
          15.5 μs ± 1.4 μs
        4096:                            OK (0.15s)
          31.6 μs ± 2.6 μs
        8192:                            OK (0.15s)
          66.7 μs ± 5.4 μs
        16384:                           OK (0.17s)
          142  μs ±  14 μs
        32768:                           OK (0.17s)
          311  μs ±  24 μs
        65536:                           OK (0.20s)
          691  μs ±  67 μs
      unpack and get last element:       OK (0.36s)
        2.58 ms ± 121 μs, 87% faster than baseline
      unpack and get first 120 elements: OK (0.27s)
        1.74 μs ±  95 ns, 99% faster than baseline

however... "unpack and get last element" from this PR is still magnitudes faster than the version above with unpackFoldr:

All
  ShortByteString
    ShortByteString conversions
      unpack
        1:                               OK (0.27s)
          21.9 ns ± 1.5 ns, 30% slower than baseline
        2:                               OK (0.34s)
          34.3 ns ± 3.0 ns, 48% slower than baseline
        4:                               OK (0.18s)
          63.3 ns ± 5.6 ns, 68% slower than baseline
        8:                               OK (0.30s)
          125  ns ± 7.0 ns, 94% slower than baseline
        16:                              OK (0.16s)
          250  ns ±  21 ns, 98% slower than baseline
        32:                              OK (0.16s)
          466  ns ±  45 ns, 82% slower than baseline
        64:                              OK (0.16s)
          896  ns ±  84 ns, 66% slower than baseline
        128:                             OK (0.26s)
          1.78 μs ±  90 ns, 81% slower than baseline
        256:                             OK (0.27s)
          3.51 μs ± 271 ns, 82% slower than baseline
        512:                             OK (0.27s)
          7.10 μs ± 328 ns, 88% slower than baseline
        1024:                            OK (0.51s)
          14.1 μs ± 446 ns, 84% slower than baseline
        2048:                            OK (0.15s)
          28.7 μs ± 2.7 μs, 86% slower than baseline
        4096:                            OK (0.25s)
          57.5 μs ± 2.8 μs, 81% slower than baseline
        8192:                            OK (0.15s)
          119  μs ±  11 μs, 78% slower than baseline
        16384:                           OK (0.26s)
          245  μs ±  14 μs, 73% slower than baseline
        32768:                           OK (0.16s)
          541  μs ±  48 μs, 77% slower than baseline
        65536:                           OK (0.16s)
          1.18 ms ± 112 μs, 78% slower than baseline
      unpack and get last element:       OK (0.20s)
        673  μs ±  42 μs, 74% faster than baseline
      unpack and get first 120 elements: OK (0.16s)
        1.88 μs ± 184 ns

clyring · 2022-07-19T18:21:10Z

Also, the benchmark results @hasufell has reported for 'unpack and look at a small number of leading elements' here and in #524 seem unbelievably large. Reading 100 bytes, allocating a hundred cons cells, and inspecting a few of them should take on the order of a microsecond, not on the order of ten milliseconds. (And it could be reasonably argued that even starting with 100 bytes is too much.) That's four orders of magnitude! I haven't tried reproducing or investigating yet, but something is very wrong somewhere.

This really terrible performance is caused by the bang on acc in Short.unpackAppendBytesStrict.go, which seems to have been mistakenly added by eaa31e0. This forces the entire result list to be built strictly from the end, rather than lazily from the start in chunks of 100 as was intended.

however... "unpack and get last element" from this PR is still magnitudes faster than the version above with unpackFoldr:

I looked into this; it stems a quirk of the current implementation of enumFromTo @Int that just happens to interact favorably with last. The base case of [x..y] corresponds to [y], so the last seen value from the previous iteration can be discarded in every branch of the fused loop. And this last seen value would otherwise be allocated as a (boxed) Word8 because it it is initially lastError.

In more typical situations, there is no advantage to peeling off the last iteration of the fused loop in this way. Indeed, it typically results mainly in the loop body being duplicated once. ~~So I will raise a GHC issue and try to remove this quirky behavior upstream.~~ EDIT: On second thought, this duplication seems unavoidable for the specific case of [minBound..maxBound], which is a bit unfortunate. Either way, I don't think it's worth worrying about whether this specific interaction happens in a bytestring benchmark.

hasufell · 2022-07-19T21:42:38Z

Alright... so I changed the implementation to be similar to Data.ByteString, utilizing the GHC.Exts.build (unpackFoldr sbs) trick. Additionally I removed the bang pattern on the acc of unpackAppendBytesStrict.

The results look pretty good... only the first few tests seem to have regressed, but everything else is apparently faster:

Bench result

All
  ShortByteString
    Small payload
      mempty:                            OK (0.46s)
        22.2 ns ± 1.4 ns, 178% slower than baseline
      UTF-8 String (naive):              OK (0.37s)
        778  ns ±  41 ns
      String (naive):                    OK (0.37s)
        777  ns ±  49 ns, 78% slower than baseline
    intercalate
      intercalate (large):               OK (0.48s)
        20.2 μs ± 1.1 μs, 137% slower than baseline
      intercalate (small):               OK (0.29s)
        1.75 μs ± 170 ns, 93% slower than baseline
      intercalate (tiny):                OK (0.31s)
        214  ns ±  21 ns, 96% slower than baseline
    partition
      strict
        mostlyTrueFast:                  OK (0.23s)
          248  μs ±  22 μs, 80% slower than baseline
        mostlyFalseFast:                 OK (0.31s)
          200  μs ±  11 μs, 67% slower than baseline
        balancedFast:                    OK (0.29s)
          382  μs ±  22 μs, 81% slower than baseline
        mostlyTrueSlow:                  OK (0.26s)
          6.57 ms ± 471 μs, 69% slower than baseline
        mostlyFalseSlow:                 OK (0.25s)
          6.19 ms ± 475 μs, 59% slower than baseline
        balancedSlow:                    OK (0.26s)
          6.44 ms ± 387 μs
    folds
      strict
        foldl
          1:                             OK (0.35s)
            29.1 ns ± 2.8 ns, 55% faster than baseline
          2:                             OK (0.34s)
            31.4 ns ± 2.6 ns, 61% faster than baseline
          4:                             OK (0.36s)
            37.3 ns ± 2.6 ns, 73% faster than baseline
          8:                             OK (0.40s)
            44.8 ns ± 2.6 ns, 81% faster than baseline
          16:                            OK (0.34s)
            66.3 ns ± 5.2 ns, 86% faster than baseline
          32:                            OK (0.37s)
            149  ns ±  11 ns, 84% faster than baseline
          64:                            OK (0.32s)
            235  ns ±  21 ns, 87% faster than baseline
          128:                           OK (0.41s)
            399  ns ±  21 ns, 89% faster than baseline
          256:                           OK (0.39s)
            750  ns ±  41 ns, 89% faster than baseline
          512:                           OK (0.37s)
            1.45 μs ±  84 ns, 90% faster than baseline
          1024:                          OK (0.36s)
            2.84 μs ± 171 ns, 90% faster than baseline
          2048:                          OK (0.32s)
            5.63 μs ± 333 ns, 90% faster than baseline
          4096:                          OK (0.34s)
            11.2 μs ± 668 ns, 90% faster than baseline
          8192:                          OK (0.29s)
            22.3 μs ± 1.4 μs, 91% faster than baseline
          16384:                         OK (0.31s)
            44.5 μs ± 2.7 μs, 91% faster than baseline
          32768:                         OK (0.30s)
            89.1 μs ± 5.7 μs, 91% faster than baseline
          65536:                         OK (0.29s)
            177  μs ±  11 μs, 92% faster than baseline
        foldl'
          1:                             OK (0.36s)
            30.3 ns ± 2.7 ns, 53% faster than baseline
          2:                             OK (0.35s)
            32.7 ns ± 2.9 ns, 60% faster than baseline
          4:                             OK (0.37s)
            37.7 ns ± 2.8 ns, 72% faster than baseline
          8:                             OK (0.41s)
            47.8 ns ± 2.8 ns, 79% faster than baseline
          16:                            OK (0.35s)
            71.2 ns ± 5.5 ns, 85% faster than baseline
          32:                            OK (0.34s)
            119  ns ±  11 ns, 87% faster than baseline
          64:                            OK (0.44s)
            214  ns ±  11 ns, 88% faster than baseline
          128:                           OK (0.31s)
            435  ns ±  42 ns, 88% faster than baseline
          256:                           OK (0.41s)
            817  ns ±  41 ns, 88% faster than baseline
          512:                           OK (0.39s)
            1.58 μs ±  83 ns, 89% faster than baseline
          1024:                          OK (0.38s)
            3.12 μs ± 169 ns, 89% faster than baseline
          2048:                          OK (0.34s)
            6.16 μs ± 345 ns, 89% faster than baseline
          4096:                          OK (0.99s)
            12.3 μs ± 166 ns, 89% faster than baseline
          8192:                          OK (0.31s)
            24.5 μs ± 1.4 μs, 90% faster than baseline
          16384:                         OK (0.33s)
            49.1 μs ± 2.6 μs, 90% faster than baseline
          32768:                         OK (0.32s)
            98.1 μs ± 5.6 μs, 90% faster than baseline
          65536:                         OK (0.31s)
            196  μs ±  11 μs, 91% faster than baseline
        foldr
          1:                             OK (0.37s)
            33.0 ns ± 2.7 ns, 43% faster than baseline
          2:                             OK (0.40s)
            38.8 ns ± 2.6 ns, 50% faster than baseline
          4:                             OK (0.45s)
            50.6 ns ± 2.8 ns, 63% faster than baseline
          8:                             OK (0.38s)
            75.6 ns ± 5.5 ns, 68% faster than baseline
          16:                            OK (0.35s)
            126  ns ±  11 ns, 74% faster than baseline
          32:                            OK (0.34s)
            264  ns ±  25 ns, 76% faster than baseline
          64:                            OK (0.32s)
            492  ns ±  46 ns, 76% faster than baseline
          128:                           OK (0.30s)
            913  ns ±  90 ns, 77% faster than baseline
          256:                           OK (0.38s)
            1.71 μs ±  83 ns, 78% faster than baseline
          512:                           OK (0.37s)
            3.33 μs ± 169 ns, 79% faster than baseline
          1024:                          OK (0.38s)
            6.57 μs ± 339 ns, 79% faster than baseline
          2048:                          OK (0.60s)
            53.9 μs ± 4.9 μs, 48% faster than baseline
          4096:                          OK (0.62s)
            115  μs ±  11 μs, 45% faster than baseline
          8192:                          OK (1.09s)
            221  μs ±  20 μs, 49% faster than baseline
          16384:                         OK (1.09s)
            443  μs ±  44 μs, 50% faster than baseline
          32768:                         OK (1.98s)
            862  μs ±  41 μs, 53% faster than baseline
          65536:                         OK (1.09s)
            1.82 ms ± 175 μs, 52% faster than baseline
        foldr'
          1:                             OK (0.47s)
            25.8 ns ± 1.4 ns, 66% faster than baseline
          2:                             OK (0.48s)
            27.5 ns ± 1.3 ns, 75% faster than baseline
          4:                             OK (0.37s)
            31.9 ns ± 2.9 ns, 83% faster than baseline
          8:                             OK (0.43s)
            44.4 ns ± 2.9 ns, 89% faster than baseline
          16:                            OK (0.37s)
            69.5 ns ± 5.3 ns, 91% faster than baseline
          32:                            OK (0.34s)
            116  ns ±  11 ns, 92% faster than baseline
          64:                            OK (0.43s)
            206  ns ±  10 ns, 93% faster than baseline
          128:                           OK (0.43s)
            415  ns ±  27 ns, 92% faster than baseline
          256:                           OK (0.40s)
            780  ns ±  41 ns, 93% faster than baseline
          512:                           OK (0.38s)
            1.51 μs ±  86 ns, 93% faster than baseline
          1024:                          OK (0.35s)
            2.97 μs ± 180 ns, 93% faster than baseline
          2048:                          OK (0.36s)
            5.90 μs ± 333 ns, 93% faster than baseline
          4096:                          OK (0.35s)
            11.7 μs ± 679 ns, 94% faster than baseline
          8192:                          OK (0.29s)
            23.4 μs ± 1.4 μs, 94% faster than baseline
          16384:                         OK (0.32s)
            46.8 μs ± 2.6 μs, 94% faster than baseline
          32768:                         OK (0.31s)
            93.6 μs ± 5.4 μs, 94% faster than baseline
          65536:                         OK (0.30s)
            187  μs ±  11 μs, 95% faster than baseline
        foldr1'
          1:                             OK (0.36s)
            60.8 ns ± 5.4 ns, 35% faster than baseline
          2:                             OK (0.44s)
            100  ns ± 8.0 ns, 25% faster than baseline
          4:                             OK (0.40s)
            172  ns ±  11 ns, 24% faster than baseline
          8:                             OK (0.35s)
            318  ns ±  23 ns, 31% faster than baseline
          16:                            OK (0.37s)
            668  ns ±  41 ns, 25% faster than baseline
          32:                            OK (0.33s)
            1.23 μs ±  84 ns, 29% faster than baseline
          64:                            OK (0.34s)
            2.43 μs ± 180 ns, 30% faster than baseline
          128:                           OK (0.32s)
            4.79 μs ± 376 ns, 31% faster than baseline
          256:                           OK (0.31s)
            9.60 μs ± 726 ns, 31% faster than baseline
          512:                           OK (0.30s)
            19.1 μs ± 1.5 μs, 31% faster than baseline
          1024:                          OK (0.29s)
            38.1 μs ± 3.1 μs, 32% faster than baseline
          2048:                          OK (0.28s)
            76.1 μs ± 5.8 μs, 32% faster than baseline
          4096:                          OK (0.27s)
            153  μs ±  11 μs, 33% faster than baseline
          8192:                          OK (0.26s)
            304  μs ±  24 μs, 35% faster than baseline
          16384:                         OK (0.25s)
            604  μs ±  51 μs, 36% faster than baseline
          32768:                         OK (0.24s)
            1.20 ms ±  88 μs, 41% faster than baseline
          65536:                         OK (0.22s)
            2.40 ms ± 184 μs, 46% faster than baseline
        unfoldrN
          1:                             OK (0.40s)
            76.4 ns ± 5.6 ns, 40% faster than baseline
          2:                             OK (0.40s)
            78.8 ns ± 5.7 ns, 53% faster than baseline
          4:                             OK (0.41s)
            83.3 ns ± 5.5 ns, 65% faster than baseline
          8:                             OK (0.43s)
            93.4 ns ± 5.7 ns, 80% faster than baseline
          16:                            OK (0.34s)
            114  ns ±  11 ns, 85% faster than baseline
          32:                            OK (0.39s)
            154  ns ±  11 ns, 89% faster than baseline
          64:                            OK (0.35s)
            259  ns ±  21 ns, 90% faster than baseline
          128:                           OK (0.31s)
            422  ns ±  41 ns, 92% faster than baseline
          256:                           OK (0.38s)
            747  ns ±  41 ns, 93% faster than baseline
          512:                           OK (0.35s)
            1.40 μs ±  82 ns, 93% faster than baseline
          1024:                          OK (0.36s)
            2.71 μs ± 169 ns, 93% faster than baseline
          2048:                          OK (0.35s)
            5.29 μs ± 328 ns, 93% faster than baseline
          4096:                          OK (3.41s)
            11.2 μs ± 885 ns, 93% faster than baseline
          8192:                          OK (3.23s)
            21.9 μs ± 953 ns, 93% faster than baseline
          16384:                         OK (1.70s)
            44.3 μs ± 3.7 μs, 93% faster than baseline
          32768:                         OK (1.67s)
            87.8 μs ± 7.2 μs, 93% faster than baseline
          65536:                         OK (1.66s)
            174  μs ±  13 μs, 93% faster than baseline
        filter
          1:                             OK (0.41s)
            79.6 ns ± 5.4 ns, 27% faster than baseline
          2:                             OK (0.41s)
            83.2 ns ± 5.1 ns, 43% faster than baseline
          4:                             OK (0.43s)
            89.0 ns ± 5.2 ns, 57% faster than baseline
          8:                             OK (0.45s)
            97.7 ns ± 5.9 ns, 71% faster than baseline
          16:                            OK (0.56s)
            148  ns ± 8.3 ns, 77% faster than baseline
          32:                            OK (0.38s)
            148  ns ±  11 ns, 87% faster than baseline
          64:                            OK (0.33s)
            259  ns ±  22 ns, 88% faster than baseline
          128:                           OK (0.43s)
            395  ns ±  21 ns, 91% faster than baseline
          256:                           OK (0.36s)
            690  ns ±  42 ns, 92% faster than baseline
          512:                           OK (0.34s)
            1.27 μs ±  89 ns, 92% faster than baseline
          1024:                          OK (0.35s)
            2.46 μs ± 171 ns, 92% faster than baseline
          2048:                          OK (0.33s)
            4.82 μs ± 328 ns, 92% faster than baseline
          4096:                          OK (1.67s)
            10.5 μs ± 961 ns, 92% faster than baseline
          8192:                          OK (1.69s)
            21.0 μs ± 1.6 μs, 92% faster than baseline
          16384:                         OK (0.53s)
            43.5 μs ± 2.1 μs, 92% faster than baseline
          32768:                         OK (0.86s)
            82.7 μs ± 4.1 μs, 92% faster than baseline
          65536:                         OK (1.46s)
            154  μs ± 2.7 μs, 93% faster than baseline
    findIndexOrLength
      takeWhile:                         OK (0.36s)
        25.4 μs ± 1.4 μs, 59% slower than baseline
      dropWhile:                         OK (0.28s)
        16.0 μs ± 1.3 μs
      break:                             OK (0.36s)
        25.5 μs ± 1.3 μs
    findIndex_
      findIndices:                       OK (0.26s)
        30.3 μs ± 2.6 μs, 28% faster than baseline
      find:                              OK (0.33s)
        468  ns ±  43 ns, 91% faster than baseline
    traversals
      map (+1) large:                    OK (0.29s)
        3.44 ms ± 179 μs
      map (+1) small:                    OK (0.42s)
        754  ns ±  42 ns
    ShortByteString strict first index
      FindIndices:                       OK (0.34s)
        483  ns ±  41 ns
      ElemIndices:                       OK (0.33s)
        482  ns ±  42 ns
      FindIndex:                         OK (0.36s)
        578  ns ±  42 ns, 47% slower than baseline
      ElemIndex:                         OK (0.38s)
        67.0 ns ± 5.1 ns
    ShortByteString strict second index
      FindIndices:                       OK (0.35s)
        1.21 μs ±  86 ns
      ElemIndices:                       OK (0.35s)
        1.21 μs ±  82 ns
      FindIndex:                         OK (1.90s)
        170  μs ±  15 μs
      ElemIndex:                         OK (1.88s)
        168  μs ±  16 μs
    ShortByteString index equality inlining
      FindIndices/inlined:               OK (0.24s)
        2.55 ms ± 179 μs
      FindIndices/non-inlined:           OK (0.30s)
        16.5 ms ± 774 μs
      FindIndex/inlined:                 OK (0.40s)
        387  ns ±  21 ns
      FindIndex/non-inlined:             OK (0.29s)
        3.65 μs ± 340 ns
    ShortByteString conversions
      unpack
        1:                               OK (0.37s)
          62.0 ns ± 5.7 ns
        2:                               OK (0.40s)
          73.6 ns ± 5.5 ns
        4:                               OK (0.43s)
          100  ns ± 5.8 ns, 19% faster than baseline
        8:                               OK (0.40s)
          160  ns ±  12 ns, 25% faster than baseline
        16:                              OK (0.36s)
          307  ns ±  23 ns, 27% faster than baseline
        32:                              OK (0.37s)
          617  ns ±  41 ns, 30% faster than baseline
        64:                              OK (0.38s)
          1.31 μs ±  93 ns, 30% faster than baseline
        128:                             OK (0.36s)
          2.62 μs ± 235 ns, 27% faster than baseline
        256:                             OK (0.35s)
          5.25 μs ± 339 ns, 26% faster than baseline
        512:                             OK (0.32s)
          10.4 μs ± 702 ns, 26% faster than baseline
        1024:                            OK (0.31s)
          20.8 μs ± 1.3 μs, 27% faster than baseline
        2048:                            OK (0.30s)
          41.7 μs ± 2.7 μs, 29% faster than baseline
        4096:                            OK (0.30s)
          84.1 μs ± 5.3 μs, 30% faster than baseline
        8192:                            OK (0.29s)
          171  μs ±  12 μs, 30% faster than baseline
        16384:                           OK (0.28s)
          345  μs ±  24 μs, 33% faster than baseline
        32768:                           OK (0.47s)
          711  μs ±  31 μs, 34% faster than baseline
        65536:                           OK (0.46s)
          1.50 ms ±  75 μs, 34% faster than baseline
      pack
        1:                               OK (0.37s)
          69.2 ns ± 6.4 ns
        2:                               OK (0.41s)
          79.9 ns ± 5.2 ns
        4:                               OK (0.46s)
          103  ns ± 6.0 ns
        8:                               OK (0.40s)
          158  ns ±  11 ns
        16:                              OK (0.37s)
          280  ns ±  21 ns
        32:                              OK (0.35s)
          545  ns ±  42 ns
        64:                              OK (0.35s)
          1.16 μs ±  86 ns
        128:                             OK (0.32s)
          2.32 μs ± 169 ns
        256:                             OK (0.33s)
          4.60 μs ± 330 ns
        512:                             OK (0.29s)
          9.15 μs ± 688 ns
        1024:                            OK (0.32s)
          20.7 μs ± 1.3 μs
        2048:                            OK (0.33s)
          42.5 μs ± 3.5 μs
        4096:                            OK (0.31s)
          86.3 μs ± 6.1 μs
        8192:                            OK (0.32s)
          185  μs ±  11 μs
        16384:                           OK (0.32s)
          379  μs ±  34 μs
        32768:                           OK (0.18s)
          750  μs ±  68 μs
        65536:                           OK (0.28s)
          1.50 ms ± 135 μs
      unpack and get last element:       OK (0.27s)
        6.68 ms ± 400 μs, 93% faster than baseline
      unpack and get first 120 elements: OK (0.36s)
        5.41 μs ± 376 ns, 99% faster than baseline

All 186 tests passed (88.73s)

However... it's still (a little) slower for the last 2 benchmarks than my original unpack that uses

unpack sbs = let ix = length sbs - 1
             in List.map (unsafeIndex sbs) [0..ix]
{-# INLINE unpack #-}

which runs

      unpack and get last element:       OK (0.36s)
        2.58 ms ± 121 μs
      unpack and get first 120 elements: OK (0.27s)
        1.74 μs ±  95 ns

vs the latest implementation:

      unpack and get last element:       OK (0.27s)
        6.68 ms ± 400 μs
      unpack and get first 120 elements: OK (0.36s)
        5.41 μs ± 376 ns

But that's quite an improvement.

clyring

Sorry for the delays, @hasufell. This seems to have an excellent power-to-weight ratio and I'd love to see it merged soon. I have one nit-pick to make, though.

Data/ByteString/Short/Internal.hs

Co-authored-by: Matthew Craven <clyring@gmail.com>

hasufell · 2022-09-26T03:22:58Z

Drone CI seems broken

sjakobi · 2022-09-26T11:31:19Z

Data/ByteString/Short/Internal.hs

 {-# OPTIONS_HADDOCK not-home #-}

+{-# OPTIONS_GHC -fno-warn-name-shadowing -fexpose-all-unfoldings #-}


I think we should probably stop using -fexpose-all-unfolding. Originally I thought that this flag would have the same effect as marking every definition as INLINEABLE, but it actually exposes the optimized unfoldings which may not interact with RULEs etc. like the unoptimized unfoldings would. The GHC issues https://gitlab.haskell.org/ghc/ghc/-/issues/22202 and https://gitlab.haskell.org/ghc/ghc/-/issues/22203 are relevant in this context.

This doesn't need to happen in this PR though.

* Add benchmarks for pack/unpack (ShortByteString) * Add benchmarks for lazy folds * Improve performance of pack/unpack/folds in ShortByteString * Use `GHC.Exts.build (unpackFoldr sbs)` for unpack and fix laziness * Don't inline 'break' with zero arguments Co-authored-by: Matthew Craven <clyring@gmail.com> Co-authored-by: Matthew Craven <clyring@gmail.com>

Bodigrim reviewed Jul 3, 2022

View reviewed changes

Data/ByteString/Short/Internal.hs Show resolved Hide resolved

hasufell force-pushed the improve-unpack branch from 486be40 to 76a4ed9 Compare July 3, 2022 21:39

hasufell added 2 commits July 5, 2022 00:30

Add benchmarks for pack/unpack (ShortByteString)

299bb55

Add benchmarks for lazy folds

abedb66

hasufell force-pushed the improve-unpack branch from 76a4ed9 to 050a3d0 Compare July 4, 2022 22:41

Improve performance of pack/unpack/folds in ShortByteString

ab11a9b

hasufell force-pushed the improve-unpack branch from 050a3d0 to ab11a9b Compare July 4, 2022 22:46

Use GHC.Exts.build (unpackFoldr sbs) for unpack and fix laziness

cbb7a11

clyring reviewed Sep 18, 2022

View reviewed changes

Data/ByteString/Short/Internal.hs Outdated Show resolved Hide resolved

Don't inline 'break' with zero arguments

0687725

Co-authored-by: Matthew Craven <clyring@gmail.com>

clyring approved these changes Sep 22, 2022

View reviewed changes

clyring requested a review from Bodigrim September 22, 2022 15:42

Bodigrim approved these changes Sep 24, 2022

View reviewed changes

Bodigrim requested a review from sjakobi September 24, 2022 02:04

sjakobi approved these changes Sep 26, 2022

View reviewed changes

Bodigrim merged commit 2471604 into haskell:master Sep 26, 2022

Bodigrim added this to the 0.11.4.0 milestone Sep 26, 2022

sjakobi mentioned this pull request Sep 26, 2022

Add Data.ByteString.Short.unconsN #525

Draft

clyring mentioned this pull request Jan 15, 2023

Test that our rewrite rules and list fusion actually work #574

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up Data.ByteString.Short.unpack #526

Speed up Data.ByteString.Short.unpack #526

hasufell commented Jul 3, 2022 •

edited

Loading

Bodigrim commented Jul 3, 2022

hasufell commented Jul 3, 2022 •

edited

Loading

hasufell commented Jul 3, 2022

sjakobi commented Jul 4, 2022

hasufell commented Jul 4, 2022

sjakobi commented Jul 4, 2022

hasufell commented Jul 4, 2022

hasufell commented Jul 4, 2022 •

edited

Loading

hasufell commented Jul 4, 2022

hasufell commented Jul 4, 2022 •

edited

Loading

clyring commented Jul 14, 2022

hasufell commented Jul 15, 2022 •

edited

Loading

clyring commented Jul 19, 2022 •

edited

Loading

hasufell commented Jul 19, 2022 •

edited

Loading

clyring left a comment

hasufell commented Sep 26, 2022

sjakobi Sep 26, 2022

		{-# OPTIONS_HADDOCK not-home #-}

		{-# OPTIONS_GHC -fno-warn-name-shadowing -fexpose-all-unfoldings #-}

Speed up Data.ByteString.Short.unpack #526

Speed up Data.ByteString.Short.unpack #526

Conversation

hasufell commented Jul 3, 2022 • edited Loading

Bodigrim commented Jul 3, 2022

hasufell commented Jul 3, 2022 • edited Loading

hasufell commented Jul 3, 2022

sjakobi commented Jul 4, 2022

hasufell commented Jul 4, 2022

sjakobi commented Jul 4, 2022

hasufell commented Jul 4, 2022

hasufell commented Jul 4, 2022 • edited Loading

hasufell commented Jul 4, 2022

hasufell commented Jul 4, 2022 • edited Loading

clyring commented Jul 14, 2022

hasufell commented Jul 15, 2022 • edited Loading

clyring commented Jul 19, 2022 • edited Loading

hasufell commented Jul 19, 2022 • edited Loading

clyring left a comment

Choose a reason for hiding this comment

hasufell commented Sep 26, 2022

sjakobi Sep 26, 2022

Choose a reason for hiding this comment

hasufell commented Jul 3, 2022 •

edited

Loading

hasufell commented Jul 3, 2022 •

edited

Loading

hasufell commented Jul 4, 2022 •

edited

Loading

hasufell commented Jul 4, 2022 •

edited

Loading

hasufell commented Jul 15, 2022 •

edited

Loading

clyring commented Jul 19, 2022 •

edited

Loading

hasufell commented Jul 19, 2022 •

edited

Loading