Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up Data.ByteString.Short.unpack #526

Merged
merged 5 commits into from
Sep 26, 2022

Conversation

hasufell
Copy link
Member

@hasufell hasufell commented Jul 3, 2022

Following the discussion from #524 (comment)

Before:

$ cabal run -w ghc-9.2.3 --enable-profiling bytestring-bench --ghc-options='-fcheck-prim-bounds -fno-ignore-asserts' -- -p 'unpack' +RTS -s
Up to date
All
  ShortByteString
    ShortByteString unpack
      unpack and look at first 100 elements: OK (0.87s)
        55.6 ms ± 4.8 ms,  66 MB allocated, 242 B  copied, 181 MB peak memory
      unpackLast:                          OK (0.99s)
        63.1 ms ± 5.2 ms,  66 MB allocated, 226 B  copied, 181 MB peak memory

All 2 tests passed (1.96s)
   2,246,950,896 bytes allocated in the heap
           7,568 bytes copied during GC
      71,372,544 bytes maximum residency (29 sample(s))
       1,339,648 bytes maximum slop
             181 MiB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0        43 colls,     0 par    1.132s   1.136s     0.0264s    0.0535s
  Gen  1        29 colls,     0 par    0.319s   0.320s     0.0110s    0.0535s

  INIT    time    0.001s  (  0.001s elapsed)
  MUT     time    0.551s  (  0.554s elapsed)
  GC      time    1.451s  (  1.457s elapsed)
  RP      time    0.000s  (  0.000s elapsed)
  PROF    time    0.000s  (  0.000s elapsed)
  EXIT    time    0.000s  (  0.000s elapsed)
  Total   time    2.003s  (  2.011s elapsed)

  %GC     time       0.0%  (0.0% elapsed)

  Alloc rate    4,075,230,737 bytes per MUT second

  Productivity  27.5% of total user, 27.5% of total elapsed

After

All
  ShortByteString
    ShortByteString unpack
      unpack and look at first 100 elements: OK (0.23s)
        1.38 μs ± 103 ns, 6.9 KB allocated,   0 B  copied,  47 MB peak memory
      unpackLast:                          OK (0.15s)
        22.6 ms ± 2.2 ms, 132 MB allocated, 436 B  copied,  47 MB peak memory

All 2 tests passed (0.41s)
   2,033,165,224 bytes allocated in the heap
           7,984 bytes copied during GC
       2,342,464 bytes maximum residency (21 sample(s))
         257,800 bytes maximum slop
              47 MiB total memory in use (1 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0        55 colls,     0 par    0.003s   0.003s     0.0000s    0.0022s
  Gen  1        21 colls,     0 par    0.025s   0.025s     0.0012s    0.0028s

  INIT    time    0.001s  (  0.001s elapsed)
  MUT     time    0.425s  (  0.428s elapsed)
  GC      time    0.028s  (  0.028s elapsed)
  RP      time    0.000s  (  0.000s elapsed)
  PROF    time    0.000s  (  0.000s elapsed)
  EXIT    time    0.000s  (  0.000s elapsed)
  Total   time    0.454s  (  0.456s elapsed)

  %GC     time       0.0%  (0.0% elapsed)

  Alloc rate    4,780,369,992 bytes per MUT second

  Productivity  93.8% of total user, 93.8% of total elapsed

I'm not sure what are the benefits of the old implementation. This one seems to be better in every regard and probably fuses better.

@Bodigrim
Copy link
Contributor

Bodigrim commented Jul 3, 2022

Git history reveals that Data.ByteString.Short.Internal.unpackAppend{Chars,Bytes}{Lazy,Strict} were added in 9e2814f, copied from Data.ByteString.Internal.unpackAppend{Chars,Bytes}{Lazy,Strict}.

-- Unpacking bytestrings into lists efficiently is a tradeoff: on the one hand
-- we would like to write a tight loop that just blasts the list into memory, on
-- the other hand we want it to be unpacked lazily so we don't end up with a
-- massive list data structure in memory.

unpackAppendBytesStrict :: ByteString -> [Word8] -> [Word8]
unpackAppendBytesStrict (BS fp len) xs =
accursedUnutterablePerformIO $ unsafeWithForeignPtr fp $ \base ->
loop (base `plusPtr` (-1)) (base `plusPtr` (-1+len)) xs
where
loop !sentinal !p acc
| p == sentinal = return acc
| otherwise = do x <- peek p
loop sentinal (p `plusPtr` (-1)) (x:acc)

This is a reasonable strategy for ByteStrings, because indexing them involves accursedUnutterablePerformIO $ unsafeWithForeignPtr fp, which is a non-negligible cost if performed per each element. The same does not hold for ShortByteString, because indexWord8Array is as cheap as it gets. Thanks for noticing this, @hasufell!

@hasufell
Copy link
Member Author

hasufell commented Jul 3, 2022

My laptop might not be too reliable about performance tests, but I ran it twice and it seems that the strict foldr got consistently slower: https://gist.github.com/hasufell/33c0736e4fcdd1be4896c08322f191f1

@hasufell
Copy link
Member Author

hasufell commented Jul 3, 2022

Core:

-- RHS size: {terms: 36, types: 12, coercions: 0, joins: 0/2}
$wunpack :: ByteArray# -> [Word8]
$wunpack
  = \ (ww :: ByteArray#) ->
      let {
        y :: Int#
        y = -# (sizeofByteArray# ww) 1# } in
      case ># 0# y of {
        __DEFAULT ->
          letrec {
            go3 :: Int# -> [Word8]
            go3
              = \ (x :: Int#) ->
                  : (case indexWord8Array# ww x of wild { __DEFAULT -> W8# wild })
                    (case ==# x y of {
                       __DEFAULT -> go3 (+# x 1#);
                       1# -> []
                     }); } in
          go3 0#;
        1# -> []
      }

-- RHS size: {terms: 6, types: 3, coercions: 0, joins: 0/0}
unpack :: ShortByteString -> [Word8]
unpack
  = \ (w :: ShortByteString) -> case w of { SBS ww1 -> $wunpack ww1 }

-- RHS size: {terms: 30, types: 25, coercions: 0, joins: 1/1}
$wfoldr' :: forall {a}. (Word8 -> a -> a) -> a -> ByteArray# -> a
$wfoldr'
  = \ (@a) (w :: Word8 -> a -> a) (w1 :: a) (ww :: ByteArray#) ->
      joinrec {
        go1 :: [Word8] -> (a -> a) -> a -> a
        go1 (ds :: [Word8]) (eta :: a -> a) (eta1 :: a)
          = case ds of {
              [] -> eta eta1;
              : y ys ->
                jump go1
                  ys (\ (z :: a) -> case w y z of vx { __DEFAULT -> eta vx }) eta1
            }; } in
      jump go1 ($wunpack ww) id w1

-- RHS size: {terms: 11, types: 9, coercions: 0, joins: 0/0}
foldr' :: forall a. (Word8 -> a -> a) -> a -> ShortByteString -> a
foldr'
  = \ (@a)
      (w :: Word8 -> a -> a)
      (w1 :: a)
      (w2 :: ShortByteString) ->
      case w2 of { SBS ww1 -> $wfoldr' w w1 ww1 }

@sjakobi
Copy link
Member

sjakobi commented Jul 4, 2022

Could you change foldr' so it operates on the ByteArray# directly instead of consuming it via unpack?

I don't find it very surprising that a foldr' that consumes a list isn't efficient.

@hasufell
Copy link
Member Author

hasufell commented Jul 4, 2022

Could you change foldr' so it operates on the ByteArray# directly instead of consuming it via unpack?

I don't find it very surprising that a foldr' that consumes a list isn't efficient.

But why is it faster with the previous version?

@sjakobi
Copy link
Member

sjakobi commented Jul 4, 2022

Comparing the Core between the old and new versions, both for foldr' itself and the foldr' benchmark might reveal the issue.

@hasufell
Copy link
Member Author

hasufell commented Jul 4, 2022

@hasufell
Copy link
Member Author

hasufell commented Jul 4, 2022

Even regular foldr is slower.

I was able to speed them up by writing explicit versions:

foldr :: (Word8 -> a -> a) -> a -> ShortByteString -> a
foldr k v = \sbs ->
  let l  = length sbs
      ba = asBA sbs
      w  = indexWord8Array ba
      go !n | n >= l    = v
            | otherwise = k (w n) (go (n + 1))
  in go 0

-- TODO: could write tail recursive right to left version
foldr' :: (Word8 -> a -> a) -> a -> ShortByteString -> a
foldr' k v = \sbs ->
  let l  = length sbs
      ba = asBA sbs
      w  = indexWord8Array ba
      go !n | n >= l    = v
            | otherwise = let !acc = go (n + 1)
                          in k (w n) acc
  in go 0

but doesn't that indicate that there's a problem with list fusion?

@hasufell
Copy link
Member Author

hasufell commented Jul 4, 2022

Ok, so lazy foldr can also be sped up by marking both unpack and unpackBytes with INLINE. That suggests that -fexpose-all-unfoldings is not really enough and that unpack doesn't fuse at all (which seems pretty bad).

That only leaves strict foldr'. No combination of inlining had improved this, so I guess we can go with the manual version here.

That leaves that question whether we have other functions that don't fuse/inline properly as well.

@hasufell
Copy link
Member Author

hasufell commented Jul 4, 2022

So, here are the benchmark results of the latest patchset, compared with the baseline: https://gist.github.com/hasufell/3710e86fcdfcd5f2c641e7d0a57e30ce

  1. folds (strict and lazy), filter, unfoldrN are faster
  2. raw unpack when just forcing the entire result is slower (kinda expected, since we have more thunks now?)
  3. unpack fuses much better now though (e.g. last . unpack or take 120 . unpack are much faster)
  4. added a lot of new INLINE pragmas (they had an effect)
  5. rest of the functions are mostly the same speed

@clyring
Copy link
Member

clyring commented Jul 14, 2022

[The existing implementation] is a reasonable strategy for ByteStrings, because indexing them involves accursedUnutterablePerformIO $ unsafeWithForeignPtr fp, which is a non-negligible cost if performed per each element. The same does not hold for ShortByteString, because indexWord8Array is as cheap as it gets. Thanks for noticing this, @hasufell!

The need to use accursedUnutterablePerformIO and unsafeWithForeignPtr does not impose any runtime cost on unsafeIndex. (unsafeWithForeignPtr might have a small cost in the current unpackBytes implementation if prevents the loop entry from being optimized to a tail call. But I haven't checked this.)

The performance characteristics of unpack for ByteString and ShortByteString should be very similar. What's good for one is very likely good for the other. The current implementation's goal is to maximize throughput when a large ByteString is unpacked and no list-fusion is possible. It tries to do this by allocating 99% fewer thunks than a fully lazy implementation would, while still allocating the result in small chunks so that a traversal can benefit from the machine's L1 cache.

If I'm reading the benchmark results @hasufell shared in the previous comment correctly, the proposed implementation takes about 75% longer in this case, which is a rather large regression. One alternative worth considering is adapting the existing fusion mechanism for Data.ByteString.unpack to the other unpack variants in bytestring.

Also, the benchmark results @hasufell has reported for 'unpack and look at a small number of leading elements' here and in #524 seem unbelievably large. Reading 100 bytes, allocating a hundred cons cells, and inspecting a few of them should take on the order of a microsecond, not on the order of ten milliseconds. (And it could be reasonably argued that even starting with 100 bytes is too much.) That's four orders of magnitude! I haven't tried reproducing or investigating yet, but something is very wrong somewhere.

@hasufell
Copy link
Member Author

hasufell commented Jul 15, 2022

@clyring when I use this implementation:

-- | /O(n)/. Convert a 'ShortByteString' into a list.
unpack :: ShortByteString -> [Word8]
unpack sbs = build (unpackFoldr sbs)
{-# INLINE unpack #-}

--
-- Have unpack fuse with good list consumers
--
unpackFoldr :: ShortByteString -> (Word8 -> a -> a) -> a -> a
unpackFoldr sbs k z = foldr k z sbs
{-# INLINE [0] unpackFoldr #-}

{-# RULES
"ShortByteString unpack-list" [1]  forall bs .
    unpackFoldr bs (:) [] = unpackBytes bs
 #-}

with the existing 100 chunk unpackBytes implementation, I get these benchmark results compared to bytestring master:

$ cabal run  bytestring-bench  -- -p 'All.ShortByteString.ShortByteString conversions' --baseline out_master.csv
Up to date
All
  ShortByteString
    ShortByteString conversions
      unpack
        1:                               OK (0.24s)
          18.6 ns ± 1.3 ns, 15% slower than baseline
        2:                               OK (0.27s)
          25.0 ns ± 1.4 ns, 17% slower than baseline
        4:                               OK (0.24s)
          41.6 ns ± 2.8 ns, 23% slower than baseline
        8:                               OK (0.38s)
          73.8 ns ± 4.8 ns, 24% slower than baseline
        16:                              OK (0.20s)
          138  ns ±  11 ns, 18% slower than baseline
        32:                              OK (0.19s)
          267  ns ±  26 ns, 14% slower than baseline
        64:                              OK (0.34s)
          551  ns ±  31 ns, 12% slower than baseline
        128:                             OK (0.31s)
          1.00 μs ±  77 ns
        256:                             OK (0.28s)
          1.92 μs ± 104 ns
        512:                             OK (0.29s)
          3.75 μs ± 232 ns
        1024:                            OK (0.29s)
          7.64 μs ± 365 ns
        2048:                            OK (0.16s)
          15.5 μs ± 1.4 μs
        4096:                            OK (0.15s)
          31.6 μs ± 2.6 μs
        8192:                            OK (0.15s)
          66.7 μs ± 5.4 μs
        16384:                           OK (0.17s)
          142  μs ±  14 μs
        32768:                           OK (0.17s)
          311  μs ±  24 μs
        65536:                           OK (0.20s)
          691  μs ±  67 μs
      unpack and get last element:       OK (0.36s)
        2.58 ms ± 121 μs, 87% faster than baseline
      unpack and get first 120 elements: OK (0.27s)
        1.74 μs ±  95 ns, 99% faster than baseline

however... "unpack and get last element" from this PR is still magnitudes faster than the version above with unpackFoldr:

All
  ShortByteString
    ShortByteString conversions
      unpack
        1:                               OK (0.27s)
          21.9 ns ± 1.5 ns, 30% slower than baseline
        2:                               OK (0.34s)
          34.3 ns ± 3.0 ns, 48% slower than baseline
        4:                               OK (0.18s)
          63.3 ns ± 5.6 ns, 68% slower than baseline
        8:                               OK (0.30s)
          125  ns ± 7.0 ns, 94% slower than baseline
        16:                              OK (0.16s)
          250  ns ±  21 ns, 98% slower than baseline
        32:                              OK (0.16s)
          466  ns ±  45 ns, 82% slower than baseline
        64:                              OK (0.16s)
          896  ns ±  84 ns, 66% slower than baseline
        128:                             OK (0.26s)
          1.78 μs ±  90 ns, 81% slower than baseline
        256:                             OK (0.27s)
          3.51 μs ± 271 ns, 82% slower than baseline
        512:                             OK (0.27s)
          7.10 μs ± 328 ns, 88% slower than baseline
        1024:                            OK (0.51s)
          14.1 μs ± 446 ns, 84% slower than baseline
        2048:                            OK (0.15s)
          28.7 μs ± 2.7 μs, 86% slower than baseline
        4096:                            OK (0.25s)
          57.5 μs ± 2.8 μs, 81% slower than baseline
        8192:                            OK (0.15s)
          119  μs ±  11 μs, 78% slower than baseline
        16384:                           OK (0.26s)
          245  μs ±  14 μs, 73% slower than baseline
        32768:                           OK (0.16s)
          541  μs ±  48 μs, 77% slower than baseline
        65536:                           OK (0.16s)
          1.18 ms ± 112 μs, 78% slower than baseline
      unpack and get last element:       OK (0.20s)
        673  μs ±  42 μs, 74% faster than baseline
      unpack and get first 120 elements: OK (0.16s)
        1.88 μs ± 184 ns

@clyring
Copy link
Member

clyring commented Jul 19, 2022

Also, the benchmark results @hasufell has reported for 'unpack and look at a small number of leading elements' here and in #524 seem unbelievably large. Reading 100 bytes, allocating a hundred cons cells, and inspecting a few of them should take on the order of a microsecond, not on the order of ten milliseconds. (And it could be reasonably argued that even starting with 100 bytes is too much.) That's four orders of magnitude! I haven't tried reproducing or investigating yet, but something is very wrong somewhere.

This really terrible performance is caused by the bang on acc in Short.unpackAppendBytesStrict.go, which seems to have been mistakenly added by eaa31e0. This forces the entire result list to be built strictly from the end, rather than lazily from the start in chunks of 100 as was intended.

however... "unpack and get last element" from this PR is still magnitudes faster than the version above with unpackFoldr:

I looked into this; it stems a quirk of the current implementation of enumFromTo @Int that just happens to interact favorably with last. The base case of [x..y] corresponds to [y], so the last seen value from the previous iteration can be discarded in every branch of the fused loop. And this last seen value would otherwise be allocated as a (boxed) Word8 because it it is initially lastError.

In more typical situations, there is no advantage to peeling off the last iteration of the fused loop in this way. Indeed, it typically results mainly in the loop body being duplicated once. So I will raise a GHC issue and try to remove this quirky behavior upstream. EDIT: On second thought, this duplication seems unavoidable for the specific case of [minBound..maxBound], which is a bit unfortunate. Either way, I don't think it's worth worrying about whether this specific interaction happens in a bytestring benchmark.

@hasufell
Copy link
Member Author

hasufell commented Jul 19, 2022

Alright... so I changed the implementation to be similar to Data.ByteString, utilizing the GHC.Exts.build (unpackFoldr sbs) trick. Additionally I removed the bang pattern on the acc of unpackAppendBytesStrict.

The results look pretty good... only the first few tests seem to have regressed, but everything else is apparently faster:

Bench result
All
  ShortByteString
    Small payload
      mempty:                            OK (0.46s)
        22.2 ns ± 1.4 ns, 178% slower than baseline
      UTF-8 String (naive):              OK (0.37s)
        778  ns ±  41 ns
      String (naive):                    OK (0.37s)
        777  ns ±  49 ns, 78% slower than baseline
    intercalate
      intercalate (large):               OK (0.48s)
        20.2 μs ± 1.1 μs, 137% slower than baseline
      intercalate (small):               OK (0.29s)
        1.75 μs ± 170 ns, 93% slower than baseline
      intercalate (tiny):                OK (0.31s)
        214  ns ±  21 ns, 96% slower than baseline
    partition
      strict
        mostlyTrueFast:                  OK (0.23s)
          248  μs ±  22 μs, 80% slower than baseline
        mostlyFalseFast:                 OK (0.31s)
          200  μs ±  11 μs, 67% slower than baseline
        balancedFast:                    OK (0.29s)
          382  μs ±  22 μs, 81% slower than baseline
        mostlyTrueSlow:                  OK (0.26s)
          6.57 ms ± 471 μs, 69% slower than baseline
        mostlyFalseSlow:                 OK (0.25s)
          6.19 ms ± 475 μs, 59% slower than baseline
        balancedSlow:                    OK (0.26s)
          6.44 ms ± 387 μs
    folds
      strict
        foldl
          1:                             OK (0.35s)
            29.1 ns ± 2.8 ns, 55% faster than baseline
          2:                             OK (0.34s)
            31.4 ns ± 2.6 ns, 61% faster than baseline
          4:                             OK (0.36s)
            37.3 ns ± 2.6 ns, 73% faster than baseline
          8:                             OK (0.40s)
            44.8 ns ± 2.6 ns, 81% faster than baseline
          16:                            OK (0.34s)
            66.3 ns ± 5.2 ns, 86% faster than baseline
          32:                            OK (0.37s)
            149  ns ±  11 ns, 84% faster than baseline
          64:                            OK (0.32s)
            235  ns ±  21 ns, 87% faster than baseline
          128:                           OK (0.41s)
            399  ns ±  21 ns, 89% faster than baseline
          256:                           OK (0.39s)
            750  ns ±  41 ns, 89% faster than baseline
          512:                           OK (0.37s)
            1.45 μs ±  84 ns, 90% faster than baseline
          1024:                          OK (0.36s)
            2.84 μs ± 171 ns, 90% faster than baseline
          2048:                          OK (0.32s)
            5.63 μs ± 333 ns, 90% faster than baseline
          4096:                          OK (0.34s)
            11.2 μs ± 668 ns, 90% faster than baseline
          8192:                          OK (0.29s)
            22.3 μs ± 1.4 μs, 91% faster than baseline
          16384:                         OK (0.31s)
            44.5 μs ± 2.7 μs, 91% faster than baseline
          32768:                         OK (0.30s)
            89.1 μs ± 5.7 μs, 91% faster than baseline
          65536:                         OK (0.29s)
            177  μs ±  11 μs, 92% faster than baseline
        foldl'
          1:                             OK (0.36s)
            30.3 ns ± 2.7 ns, 53% faster than baseline
          2:                             OK (0.35s)
            32.7 ns ± 2.9 ns, 60% faster than baseline
          4:                             OK (0.37s)
            37.7 ns ± 2.8 ns, 72% faster than baseline
          8:                             OK (0.41s)
            47.8 ns ± 2.8 ns, 79% faster than baseline
          16:                            OK (0.35s)
            71.2 ns ± 5.5 ns, 85% faster than baseline
          32:                            OK (0.34s)
            119  ns ±  11 ns, 87% faster than baseline
          64:                            OK (0.44s)
            214  ns ±  11 ns, 88% faster than baseline
          128:                           OK (0.31s)
            435  ns ±  42 ns, 88% faster than baseline
          256:                           OK (0.41s)
            817  ns ±  41 ns, 88% faster than baseline
          512:                           OK (0.39s)
            1.58 μs ±  83 ns, 89% faster than baseline
          1024:                          OK (0.38s)
            3.12 μs ± 169 ns, 89% faster than baseline
          2048:                          OK (0.34s)
            6.16 μs ± 345 ns, 89% faster than baseline
          4096:                          OK (0.99s)
            12.3 μs ± 166 ns, 89% faster than baseline
          8192:                          OK (0.31s)
            24.5 μs ± 1.4 μs, 90% faster than baseline
          16384:                         OK (0.33s)
            49.1 μs ± 2.6 μs, 90% faster than baseline
          32768:                         OK (0.32s)
            98.1 μs ± 5.6 μs, 90% faster than baseline
          65536:                         OK (0.31s)
            196  μs ±  11 μs, 91% faster than baseline
        foldr
          1:                             OK (0.37s)
            33.0 ns ± 2.7 ns, 43% faster than baseline
          2:                             OK (0.40s)
            38.8 ns ± 2.6 ns, 50% faster than baseline
          4:                             OK (0.45s)
            50.6 ns ± 2.8 ns, 63% faster than baseline
          8:                             OK (0.38s)
            75.6 ns ± 5.5 ns, 68% faster than baseline
          16:                            OK (0.35s)
            126  ns ±  11 ns, 74% faster than baseline
          32:                            OK (0.34s)
            264  ns ±  25 ns, 76% faster than baseline
          64:                            OK (0.32s)
            492  ns ±  46 ns, 76% faster than baseline
          128:                           OK (0.30s)
            913  ns ±  90 ns, 77% faster than baseline
          256:                           OK (0.38s)
            1.71 μs ±  83 ns, 78% faster than baseline
          512:                           OK (0.37s)
            3.33 μs ± 169 ns, 79% faster than baseline
          1024:                          OK (0.38s)
            6.57 μs ± 339 ns, 79% faster than baseline
          2048:                          OK (0.60s)
            53.9 μs ± 4.9 μs, 48% faster than baseline
          4096:                          OK (0.62s)
            115  μs ±  11 μs, 45% faster than baseline
          8192:                          OK (1.09s)
            221  μs ±  20 μs, 49% faster than baseline
          16384:                         OK (1.09s)
            443  μs ±  44 μs, 50% faster than baseline
          32768:                         OK (1.98s)
            862  μs ±  41 μs, 53% faster than baseline
          65536:                         OK (1.09s)
            1.82 ms ± 175 μs, 52% faster than baseline
        foldr'
          1:                             OK (0.47s)
            25.8 ns ± 1.4 ns, 66% faster than baseline
          2:                             OK (0.48s)
            27.5 ns ± 1.3 ns, 75% faster than baseline
          4:                             OK (0.37s)
            31.9 ns ± 2.9 ns, 83% faster than baseline
          8:                             OK (0.43s)
            44.4 ns ± 2.9 ns, 89% faster than baseline
          16:                            OK (0.37s)
            69.5 ns ± 5.3 ns, 91% faster than baseline
          32:                            OK (0.34s)
            116  ns ±  11 ns, 92% faster than baseline
          64:                            OK (0.43s)
            206  ns ±  10 ns, 93% faster than baseline
          128:                           OK (0.43s)
            415  ns ±  27 ns, 92% faster than baseline
          256:                           OK (0.40s)
            780  ns ±  41 ns, 93% faster than baseline
          512:                           OK (0.38s)
            1.51 μs ±  86 ns, 93% faster than baseline
          1024:                          OK (0.35s)
            2.97 μs ± 180 ns, 93% faster than baseline
          2048:                          OK (0.36s)
            5.90 μs ± 333 ns, 93% faster than baseline
          4096:                          OK (0.35s)
            11.7 μs ± 679 ns, 94% faster than baseline
          8192:                          OK (0.29s)
            23.4 μs ± 1.4 μs, 94% faster than baseline
          16384:                         OK (0.32s)
            46.8 μs ± 2.6 μs, 94% faster than baseline
          32768:                         OK (0.31s)
            93.6 μs ± 5.4 μs, 94% faster than baseline
          65536:                         OK (0.30s)
            187  μs ±  11 μs, 95% faster than baseline
        foldr1'
          1:                             OK (0.36s)
            60.8 ns ± 5.4 ns, 35% faster than baseline
          2:                             OK (0.44s)
            100  ns ± 8.0 ns, 25% faster than baseline
          4:                             OK (0.40s)
            172  ns ±  11 ns, 24% faster than baseline
          8:                             OK (0.35s)
            318  ns ±  23 ns, 31% faster than baseline
          16:                            OK (0.37s)
            668  ns ±  41 ns, 25% faster than baseline
          32:                            OK (0.33s)
            1.23 μs ±  84 ns, 29% faster than baseline
          64:                            OK (0.34s)
            2.43 μs ± 180 ns, 30% faster than baseline
          128:                           OK (0.32s)
            4.79 μs ± 376 ns, 31% faster than baseline
          256:                           OK (0.31s)
            9.60 μs ± 726 ns, 31% faster than baseline
          512:                           OK (0.30s)
            19.1 μs ± 1.5 μs, 31% faster than baseline
          1024:                          OK (0.29s)
            38.1 μs ± 3.1 μs, 32% faster than baseline
          2048:                          OK (0.28s)
            76.1 μs ± 5.8 μs, 32% faster than baseline
          4096:                          OK (0.27s)
            153  μs ±  11 μs, 33% faster than baseline
          8192:                          OK (0.26s)
            304  μs ±  24 μs, 35% faster than baseline
          16384:                         OK (0.25s)
            604  μs ±  51 μs, 36% faster than baseline
          32768:                         OK (0.24s)
            1.20 ms ±  88 μs, 41% faster than baseline
          65536:                         OK (0.22s)
            2.40 ms ± 184 μs, 46% faster than baseline
        unfoldrN
          1:                             OK (0.40s)
            76.4 ns ± 5.6 ns, 40% faster than baseline
          2:                             OK (0.40s)
            78.8 ns ± 5.7 ns, 53% faster than baseline
          4:                             OK (0.41s)
            83.3 ns ± 5.5 ns, 65% faster than baseline
          8:                             OK (0.43s)
            93.4 ns ± 5.7 ns, 80% faster than baseline
          16:                            OK (0.34s)
            114  ns ±  11 ns, 85% faster than baseline
          32:                            OK (0.39s)
            154  ns ±  11 ns, 89% faster than baseline
          64:                            OK (0.35s)
            259  ns ±  21 ns, 90% faster than baseline
          128:                           OK (0.31s)
            422  ns ±  41 ns, 92% faster than baseline
          256:                           OK (0.38s)
            747  ns ±  41 ns, 93% faster than baseline
          512:                           OK (0.35s)
            1.40 μs ±  82 ns, 93% faster than baseline
          1024:                          OK (0.36s)
            2.71 μs ± 169 ns, 93% faster than baseline
          2048:                          OK (0.35s)
            5.29 μs ± 328 ns, 93% faster than baseline
          4096:                          OK (3.41s)
            11.2 μs ± 885 ns, 93% faster than baseline
          8192:                          OK (3.23s)
            21.9 μs ± 953 ns, 93% faster than baseline
          16384:                         OK (1.70s)
            44.3 μs ± 3.7 μs, 93% faster than baseline
          32768:                         OK (1.67s)
            87.8 μs ± 7.2 μs, 93% faster than baseline
          65536:                         OK (1.66s)
            174  μs ±  13 μs, 93% faster than baseline
        filter
          1:                             OK (0.41s)
            79.6 ns ± 5.4 ns, 27% faster than baseline
          2:                             OK (0.41s)
            83.2 ns ± 5.1 ns, 43% faster than baseline
          4:                             OK (0.43s)
            89.0 ns ± 5.2 ns, 57% faster than baseline
          8:                             OK (0.45s)
            97.7 ns ± 5.9 ns, 71% faster than baseline
          16:                            OK (0.56s)
            148  ns ± 8.3 ns, 77% faster than baseline
          32:                            OK (0.38s)
            148  ns ±  11 ns, 87% faster than baseline
          64:                            OK (0.33s)
            259  ns ±  22 ns, 88% faster than baseline
          128:                           OK (0.43s)
            395  ns ±  21 ns, 91% faster than baseline
          256:                           OK (0.36s)
            690  ns ±  42 ns, 92% faster than baseline
          512:                           OK (0.34s)
            1.27 μs ±  89 ns, 92% faster than baseline
          1024:                          OK (0.35s)
            2.46 μs ± 171 ns, 92% faster than baseline
          2048:                          OK (0.33s)
            4.82 μs ± 328 ns, 92% faster than baseline
          4096:                          OK (1.67s)
            10.5 μs ± 961 ns, 92% faster than baseline
          8192:                          OK (1.69s)
            21.0 μs ± 1.6 μs, 92% faster than baseline
          16384:                         OK (0.53s)
            43.5 μs ± 2.1 μs, 92% faster than baseline
          32768:                         OK (0.86s)
            82.7 μs ± 4.1 μs, 92% faster than baseline
          65536:                         OK (1.46s)
            154  μs ± 2.7 μs, 93% faster than baseline
    findIndexOrLength
      takeWhile:                         OK (0.36s)
        25.4 μs ± 1.4 μs, 59% slower than baseline
      dropWhile:                         OK (0.28s)
        16.0 μs ± 1.3 μs
      break:                             OK (0.36s)
        25.5 μs ± 1.3 μs
    findIndex_
      findIndices:                       OK (0.26s)
        30.3 μs ± 2.6 μs, 28% faster than baseline
      find:                              OK (0.33s)
        468  ns ±  43 ns, 91% faster than baseline
    traversals
      map (+1) large:                    OK (0.29s)
        3.44 ms ± 179 μs
      map (+1) small:                    OK (0.42s)
        754  ns ±  42 ns
    ShortByteString strict first index
      FindIndices:                       OK (0.34s)
        483  ns ±  41 ns
      ElemIndices:                       OK (0.33s)
        482  ns ±  42 ns
      FindIndex:                         OK (0.36s)
        578  ns ±  42 ns, 47% slower than baseline
      ElemIndex:                         OK (0.38s)
        67.0 ns ± 5.1 ns
    ShortByteString strict second index
      FindIndices:                       OK (0.35s)
        1.21 μs ±  86 ns
      ElemIndices:                       OK (0.35s)
        1.21 μs ±  82 ns
      FindIndex:                         OK (1.90s)
        170  μs ±  15 μs
      ElemIndex:                         OK (1.88s)
        168  μs ±  16 μs
    ShortByteString index equality inlining
      FindIndices/inlined:               OK (0.24s)
        2.55 ms ± 179 μs
      FindIndices/non-inlined:           OK (0.30s)
        16.5 ms ± 774 μs
      FindIndex/inlined:                 OK (0.40s)
        387  ns ±  21 ns
      FindIndex/non-inlined:             OK (0.29s)
        3.65 μs ± 340 ns
    ShortByteString conversions
      unpack
        1:                               OK (0.37s)
          62.0 ns ± 5.7 ns
        2:                               OK (0.40s)
          73.6 ns ± 5.5 ns
        4:                               OK (0.43s)
          100  ns ± 5.8 ns, 19% faster than baseline
        8:                               OK (0.40s)
          160  ns ±  12 ns, 25% faster than baseline
        16:                              OK (0.36s)
          307  ns ±  23 ns, 27% faster than baseline
        32:                              OK (0.37s)
          617  ns ±  41 ns, 30% faster than baseline
        64:                              OK (0.38s)
          1.31 μs ±  93 ns, 30% faster than baseline
        128:                             OK (0.36s)
          2.62 μs ± 235 ns, 27% faster than baseline
        256:                             OK (0.35s)
          5.25 μs ± 339 ns, 26% faster than baseline
        512:                             OK (0.32s)
          10.4 μs ± 702 ns, 26% faster than baseline
        1024:                            OK (0.31s)
          20.8 μs ± 1.3 μs, 27% faster than baseline
        2048:                            OK (0.30s)
          41.7 μs ± 2.7 μs, 29% faster than baseline
        4096:                            OK (0.30s)
          84.1 μs ± 5.3 μs, 30% faster than baseline
        8192:                            OK (0.29s)
          171  μs ±  12 μs, 30% faster than baseline
        16384:                           OK (0.28s)
          345  μs ±  24 μs, 33% faster than baseline
        32768:                           OK (0.47s)
          711  μs ±  31 μs, 34% faster than baseline
        65536:                           OK (0.46s)
          1.50 ms ±  75 μs, 34% faster than baseline
      pack
        1:                               OK (0.37s)
          69.2 ns ± 6.4 ns
        2:                               OK (0.41s)
          79.9 ns ± 5.2 ns
        4:                               OK (0.46s)
          103  ns ± 6.0 ns
        8:                               OK (0.40s)
          158  ns ±  11 ns
        16:                              OK (0.37s)
          280  ns ±  21 ns
        32:                              OK (0.35s)
          545  ns ±  42 ns
        64:                              OK (0.35s)
          1.16 μs ±  86 ns
        128:                             OK (0.32s)
          2.32 μs ± 169 ns
        256:                             OK (0.33s)
          4.60 μs ± 330 ns
        512:                             OK (0.29s)
          9.15 μs ± 688 ns
        1024:                            OK (0.32s)
          20.7 μs ± 1.3 μs
        2048:                            OK (0.33s)
          42.5 μs ± 3.5 μs
        4096:                            OK (0.31s)
          86.3 μs ± 6.1 μs
        8192:                            OK (0.32s)
          185  μs ±  11 μs
        16384:                           OK (0.32s)
          379  μs ±  34 μs
        32768:                           OK (0.18s)
          750  μs ±  68 μs
        65536:                           OK (0.28s)
          1.50 ms ± 135 μs
      unpack and get last element:       OK (0.27s)
        6.68 ms ± 400 μs, 93% faster than baseline
      unpack and get first 120 elements: OK (0.36s)
        5.41 μs ± 376 ns, 99% faster than baseline

All 186 tests passed (88.73s)

However... it's still (a little) slower for the last 2 benchmarks than my original unpack that uses

unpack sbs = let ix = length sbs - 1
             in List.map (unsafeIndex sbs) [0..ix]
{-# INLINE unpack #-}

which runs

      unpack and get last element:       OK (0.36s)
        2.58 ms ± 121 μs
      unpack and get first 120 elements: OK (0.27s)
        1.74 μs ±  95 ns

vs the latest implementation:

      unpack and get last element:       OK (0.27s)
        6.68 ms ± 400 μs
      unpack and get first 120 elements: OK (0.36s)
        5.41 μs ± 376 ns

But that's quite an improvement.

Copy link
Member

@clyring clyring left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delays, @hasufell. This seems to have an excellent power-to-weight ratio and I'd love to see it merged soon. I have one nit-pick to make, though.

Data/ByteString/Short/Internal.hs Outdated Show resolved Hide resolved
Co-authored-by: Matthew Craven <clyring@gmail.com>
@hasufell
Copy link
Member Author

Drone CI seems broken

{-# OPTIONS_HADDOCK not-home #-}

{-# OPTIONS_GHC -fno-warn-name-shadowing -fexpose-all-unfoldings #-}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should probably stop using -fexpose-all-unfolding. Originally I thought that this flag would have the same effect as marking every definition as INLINEABLE, but it actually exposes the optimized unfoldings which may not interact with RULEs etc. like the unoptimized unfoldings would. The GHC issues https://gitlab.haskell.org/ghc/ghc/-/issues/22202 and https://gitlab.haskell.org/ghc/ghc/-/issues/22203 are relevant in this context.

This doesn't need to happen in this PR though.

@Bodigrim Bodigrim merged commit 2471604 into haskell:master Sep 26, 2022
@Bodigrim Bodigrim added this to the 0.11.4.0 milestone Sep 26, 2022
Bodigrim pushed a commit to Bodigrim/bytestring that referenced this pull request Nov 26, 2022
* Add benchmarks for pack/unpack (ShortByteString)

* Add benchmarks for lazy folds

* Improve performance of pack/unpack/folds in ShortByteString

* Use `GHC.Exts.build (unpackFoldr sbs)` for unpack and fix laziness

* Don't inline 'break' with zero arguments

Co-authored-by: Matthew Craven <clyring@gmail.com>

Co-authored-by: Matthew Craven <clyring@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants