Improve deserialisation performance #95

alt-romes · 2024-06-11T16:57:03Z

Use the lower-level array-construction primitives to avoid intermediate allocations and perform much better at deserialisation.

On Cabal (which uses tar for the hackage index), we observed:

Deserialisation of IntTries go from 1.5s to 200ms, with 10GB of allocations going down to roughly 600MB.
StringTable deserialization go from 700ms to 50ms, with 4GB of allocations going down to 80MB.

Codec/Archive/Tar/Index/Utils.hs

Bodigrim · 2024-06-11T22:55:13Z

Please rebase atop of the latest master, I've added more CI jobs.

Bodigrim · 2024-06-12T18:33:46Z

@alt-romes if you get stuck for a specific reason, I'm happy to take a closer look, but otherwise please ping me when CI becomes green.

tar.cabal

Bodigrim · 2024-06-13T19:36:44Z

Codec/Archive/Tar/Index/Utils.hs

@@ -63,10 +65,16 @@ readInt32BE :: BS.ByteString -> Int -> Int32
 readInt32BE bs i = fromIntegral (readWord32BE bs i)
 {-# INLINE readInt32BE #-}

+readWord32OffPtrBE :: Ptr Word32 -> Int -> IO Word32
+readWord32OffPtrBE ptr i = do
+  case targetByteOrder of


targetByteOrder was broken before GHC 8.10, so it would be more robust to check #if defined(WORDS_BIGENDIAN).

Bodigrim · 2024-06-13T19:38:25Z

Looks great modulo one suggestion!

How can I reproduce the performance gains?

alt-romes · 2024-06-13T20:23:29Z

Looks great modulo one suggestion!

How can I reproduce the performance gains?

I suggest building Cabal with a dependency on this patched tar with profiling. Then, build another Cabal with the profiled Cabal with +RTS -pj and load the cabal.prof into speedscope.app

So... that would be (writing on my phone, there may be some mistakes):

git clone https://github.com/haskell/cabal
cd cabal
git worktree add ../cabal2 -b wip/reproducing
echo "packages: ., /path/to/tar" >> cabal.project
echo "package *" >> cabal.project
echo "  profiling-detail: late" >> cabal.project
echo "profiling: true" >> cabal.project
cabal build exe:cabal
MYCABAL=$(cabal list-bin exe:cabal)
cd ../cabal2
$MYCABAL +RTS -pj -RTS build exe:cabal
ls cabal.prof
# now, open this file on speedscope.app

You should see the deseralise function take a sizeable chunk of the profile.

Then do cabal clean in cabal2 and try the same thing with the patched version of tar.

Note: on aarch64 I got an improvement from some 2s to 250ms on deserialize.

Then, I investigated the loop assembly and discovered the next bottleneck was a call to hs_bswap32 in the middle of the loop. I implemented the assembly native code generator for this instruction and put up a PR today. This reduced the time some 3x, to 70ms (with this patched GHV).

So, on x64 you may be able to get even better than 200ms for deserialise right off the bat due to the ncg already producing proper assembly for byte swap in that architecture.

Cheers!

mpickering · 2024-06-14T08:02:17Z

I have added a benchmark which shows the improvement.

Use the lower-level array-construction primitives to avoid intermediate allocations and perform much better at deserialisation. On Cabal (which uses tar for the hackage index), we observed: * Deserialisation of IntTries go from 1.5s to 200ms, with 10GB of allocations going down to roughly 600MB. * StringTable deserialization go from 700ms to 50ms, with 4GB of allocations going down to 80MB. Unfortunately, the newGenArray primitive was only introduced in array 0.5.6. Since we can't update the bound to force such a recent version of array, we implement the beToLe function using unboxed array primitives that have been long available, rather than newGenArray.

This improves from: deserialise index: OK 10.0 ms ± 246 μs to: deserialise index: OK 527 μs ± 43 μs Due to the previous commit

verifyFetchedTarball has the effect of deserialising the index tarball (see call to Sec.withIndex). verifyFetchedTarball is called individually for each package in the build plan (see ProjectPlanning.hs). Not once per repo. The hackage tarball is now 880mb so it takes a non significant amount of time to deserialise this (much better after haskell/tar#95). This code path is important as it can add 1s with these 38 calls on to the initial load of a project and scales linearly with the size of your build tree. Reproducer: Simple project with "lens" dependency deserialises the index tarball 38 times. Solution: Refactor verifyFetchedTarball to run once per repo rather than once per package. In future it would be much better to refactor this function so that the items are not immediately grouped and ungrouped but I didn't want to take that on immediately. Fixes haskell#10110

alt-romes · 2024-06-14T11:42:36Z

This is ready @Bodigrim !

Bodigrim · 2024-06-14T19:37:37Z

Benchmarks of aarch64 with GHC 9.6:

Before
  deserialise index: OK
    10.4 ms ± 904 μs, 105 MB allocated,  25 KB copied, 124 MB peak memory
After
  deserialise index: OK
    2.47 ms ± 210 μs, 4.5 MB allocated, 398 B  copied, 124 MB peak memory, 76% less than baseline

This is very impressive, thanks, guys!

alt-romes · 2024-06-14T21:03:24Z

Benchmarks of aarch64 with vs without the patch that makes GHC produce assembly for byteswap rather than a function call:

Before

$ cabal run bench -- -p "deserialise" --csv before.x
All
  deserialise index: OK
    2.34 ms ± 213 μs

After

cabal run -w /Users/romes/ghc-dev/ghc/_build/stage1/bin/ghc bench -- -p "deserialise" --baseline before.x
All
  deserialise index: OK
    768  μs ±  61 μs, 67% less than baseline

Bodigrim · 2024-06-15T14:14:41Z

Released as tar-0.6.3.0; it would be great if the forthcoming cabal-install-3.12 release picks it up.

verifyFetchedTarball has the effect of deserialising the index tarball (see call to Sec.withIndex). verifyFetchedTarball is called individually for each package in the build plan (see ProjectPlanning.hs). Not once per repo. The hackage tarball is now 880mb so it takes a non significant amount of time to deserialise this (much better after haskell/tar#95). This code path is important as it can add 1s with these 38 calls on to the initial load of a project and scales linearly with the size of your build tree. Reproducer: Simple project with "lens" dependency deserialises the index tarball 38 times. Solution: Refactor verifyFetchedTarball to run once per repo rather than once per package. In future it would be much better to refactor this function so that the items are not immediately grouped and ungrouped but I didn't want to take that on immediately. Fixes haskell#10110

verifyFetchedTarball has the effect of deserialising the index tarball (see call to Sec.withIndex). verifyFetchedTarball is called individually for each package in the build plan (see ProjectPlanning.hs). Not once per repo. The hackage tarball is now 880mb so it takes a non significant amount of time to deserialise this (much better after haskell/tar#95). This code path is important as it can add 1s with these 38 calls on to the initial load of a project and scales linearly with the size of your build tree. Reproducer: Simple project with "lens" dependency deserialises the index tarball 38 times. Solution: Refactor verifyFetchedTarball to run once per repo rather than once per package. In future it would be much better to refactor this function so that the items are not immediately grouped and ungrouped but I didn't want to take that on immediately. Fixes #10110 (cherry picked from commit 7d46115)

verifyFetchedTarball has the effect of deserialising the index tarball (see call to Sec.withIndex). verifyFetchedTarball is called individually for each package in the build plan (see ProjectPlanning.hs). Not once per repo. The hackage tarball is now 880mb so it takes a non significant amount of time to deserialise this (much better after haskell/tar#95). This code path is important as it can add 1s with these 38 calls on to the initial load of a project and scales linearly with the size of your build tree. Reproducer: Simple project with "lens" dependency deserialises the index tarball 38 times. Solution: Refactor verifyFetchedTarball to run once per repo rather than once per package. In future it would be much better to refactor this function so that the items are not immediately grouped and ungrouped but I didn't want to take that on immediately. Fixes #10110 (cherry picked from commit 7d46115) Co-authored-by: Matthew Pickering <matthewtpickering@gmail.com>

verifyFetchedTarball has the effect of deserialising the index tarball (see call to Sec.withIndex). verifyFetchedTarball is called individually for each package in the build plan (see ProjectPlanning.hs). Not once per repo. The hackage tarball is now 880mb so it takes a non significant amount of time to deserialise this (much better after haskell/tar#95). This code path is important as it can add 1s with these 38 calls on to the initial load of a project and scales linearly with the size of your build tree. Reproducer: Simple project with "lens" dependency deserialises the index tarball 38 times. Solution: Refactor verifyFetchedTarball to run once per repo rather than once per package. In future it would be much better to refactor this function so that the items are not immediately grouped and ungrouped but I didn't want to take that on immediately. Fixes haskell#10110

alt-romes force-pushed the master branch 2 times, most recently from 002d68a to c6967db Compare June 11, 2024 21:49

Bodigrim reviewed Jun 11, 2024

View reviewed changes

Codec/Archive/Tar/Index/Utils.hs Outdated Show resolved Hide resolved

Bodigrim reviewed Jun 11, 2024

View reviewed changes

Codec/Archive/Tar/Index/Utils.hs Outdated Show resolved Hide resolved

alt-romes force-pushed the master branch from c6967db to 49c72a4 Compare June 12, 2024 05:51

alt-romes requested review from Bodigrim June 12, 2024 15:28

Bodigrim reviewed Jun 12, 2024

View reviewed changes

tar.cabal Outdated Show resolved Hide resolved

alt-romes force-pushed the master branch 2 times, most recently from 2864520 to 556bca0 Compare June 13, 2024 12:08

mpickering force-pushed the master branch from 556bca0 to c15666e Compare June 13, 2024 16:38

Bodigrim reviewed Jun 13, 2024

View reviewed changes

mpickering force-pushed the master branch from 713d0f9 to 59776ab Compare June 14, 2024 08:01

alt-romes and others added 2 commits June 14, 2024 09:14

Add benchmark for deserialising TarIndex

f719a72

This improves from: deserialise index: OK 10.0 ms ± 246 μs to: deserialise index: OK 527 μs ± 43 μs Due to the previous commit

mpickering force-pushed the master branch from 59776ab to f719a72 Compare June 14, 2024 08:14

mpickering mentioned this pull request Jun 14, 2024

perf: verifyFetchedTarball called once for each package from a remote repository haskell/cabal#10110

Closed

mpickering mentioned this pull request Jun 14, 2024

perf: Group together packages by repo when verifying tarballs haskell/cabal#10112

Merged

7 tasks

alt-romes requested a review from Bodigrim June 14, 2024 11:42

Bodigrim merged commit 6536e03 into haskell:master Jun 14, 2024
22 checks passed

mergify bot mentioned this pull request Jun 18, 2024

perf: Group together packages by repo when verifying tarballs (backport #10112) haskell/cabal#10121

Merged

7 tasks

Bodigrim mentioned this pull request Jun 24, 2024

Refactor happyDoActions haskell/happy#280

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve deserialisation performance #95

Improve deserialisation performance #95

alt-romes commented Jun 11, 2024

Bodigrim commented Jun 11, 2024

Bodigrim commented Jun 12, 2024

Bodigrim Jun 13, 2024

Bodigrim commented Jun 13, 2024

alt-romes commented Jun 13, 2024

mpickering commented Jun 14, 2024

alt-romes commented Jun 14, 2024

Bodigrim commented Jun 14, 2024

alt-romes commented Jun 14, 2024 •

edited

Loading

Bodigrim commented Jun 15, 2024

Improve deserialisation performance #95

Improve deserialisation performance #95

Conversation

alt-romes commented Jun 11, 2024

Bodigrim commented Jun 11, 2024

Bodigrim commented Jun 12, 2024

Bodigrim Jun 13, 2024

Choose a reason for hiding this comment

Bodigrim commented Jun 13, 2024

alt-romes commented Jun 13, 2024

mpickering commented Jun 14, 2024

alt-romes commented Jun 14, 2024

Bodigrim commented Jun 14, 2024

alt-romes commented Jun 14, 2024 • edited Loading

Bodigrim commented Jun 15, 2024

alt-romes commented Jun 14, 2024 •

edited

Loading