Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve deserialisation performance #95

Merged
merged 2 commits into from
Jun 14, 2024
Merged

Conversation

alt-romes
Copy link
Contributor

Use the lower-level array-construction primitives to avoid intermediate allocations and perform much better at deserialisation.

On Cabal (which uses tar for the hackage index), we observed:

  • Deserialisation of IntTries go from 1.5s to 200ms, with 10GB of allocations going down to roughly 600MB.
  • StringTable deserialization go from 700ms to 50ms, with 4GB of allocations going down to 80MB.

@alt-romes alt-romes force-pushed the master branch 2 times, most recently from 002d68a to c6967db Compare June 11, 2024 21:49
@Bodigrim
Copy link
Contributor

Please rebase atop of the latest master, I've added more CI jobs.

@Bodigrim
Copy link
Contributor

@alt-romes if you get stuck for a specific reason, I'm happy to take a closer look, but otherwise please ping me when CI becomes green.

tar.cabal Outdated Show resolved Hide resolved
@@ -63,10 +65,16 @@ readInt32BE :: BS.ByteString -> Int -> Int32
readInt32BE bs i = fromIntegral (readWord32BE bs i)
{-# INLINE readInt32BE #-}

readWord32OffPtrBE :: Ptr Word32 -> Int -> IO Word32
readWord32OffPtrBE ptr i = do
case targetByteOrder of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

targetByteOrder was broken before GHC 8.10, so it would be more robust to check #if defined(WORDS_BIGENDIAN).

@Bodigrim
Copy link
Contributor

Looks great modulo one suggestion!

How can I reproduce the performance gains?

@alt-romes
Copy link
Contributor Author

Looks great modulo one suggestion!

How can I reproduce the performance gains?

I suggest building Cabal with a dependency on this patched tar with profiling. Then, build another Cabal with the profiled Cabal with +RTS -pj and load the cabal.prof into speedscope.app

So... that would be (writing on my phone, there may be some mistakes):

git clone https://github.com/haskell/cabal
cd cabal
git worktree add ../cabal2 -b wip/reproducing
echo "packages: ., /path/to/tar" >> cabal.project
echo "package *" >> cabal.project
echo "  profiling-detail: late" >> cabal.project
echo "profiling: true" >> cabal.project
cabal build exe:cabal
MYCABAL=$(cabal list-bin exe:cabal)
cd ../cabal2
$MYCABAL +RTS -pj -RTS build exe:cabal
ls cabal.prof
# now, open this file on speedscope.app

You should see the deseralise function take a sizeable chunk of the profile.

Then do cabal clean in cabal2 and try the same thing with the patched version of tar.

Note: on aarch64 I got an improvement from some 2s to 250ms on deserialize.

Then, I investigated the loop assembly and discovered the next bottleneck was a call to hs_bswap32 in the middle of the loop. I implemented the assembly native code generator for this instruction and put up a PR today. This reduced the time some 3x, to 70ms (with this patched GHV).

So, on x64 you may be able to get even better than 200ms for deserialise right off the bat due to the ncg already producing proper assembly for byte swap in that architecture.

Cheers!

@mpickering
Copy link
Contributor

I have added a benchmark which shows the improvement.

alt-romes and others added 2 commits June 14, 2024 09:14
Use the lower-level array-construction primitives to avoid intermediate
allocations and perform much better at deserialisation.

On Cabal (which uses tar for the hackage index), we observed:
* Deserialisation of IntTries go from 1.5s to 200ms, with 10GB of allocations going down to roughly 600MB.
* StringTable deserialization go from 700ms to 50ms, with 4GB of allocations going down to 80MB.

Unfortunately, the newGenArray primitive was only introduced in array 0.5.6.
Since we can't update the bound to force such a recent version of array,
we implement the beToLe function using unboxed array primitives that
have been long available, rather than newGenArray.
This improves from:
  deserialise index: OK
    10.0 ms ± 246 μs

to:
  deserialise index: OK
    527  μs ±  43 μs

Due to the previous commit
mpickering added a commit to mpickering/cabal that referenced this pull request Jun 14, 2024
verifyFetchedTarball has the effect of deserialising the index tarball
(see call to Sec.withIndex).

verifyFetchedTarball is called individually for each package in the
build plan (see ProjectPlanning.hs). Not once per repo.

The hackage tarball is now 880mb so it takes a non significant amount of
time to deserialise this (much better after haskell/tar#95).

This code path is important as it can add 1s with these 38 calls on to
the initial load of a project and scales linearly with the size of your
build tree.

Reproducer: Simple project with "lens" dependency deserialises the index tarball 38 times.

Solution: Refactor verifyFetchedTarball to run once per repo rather than once per package.

In future it would be much better to refactor this function so that the
items are not immediately grouped and ungrouped but I didn't want to
take that on immediately.

Fixes haskell#10110
mpickering added a commit to mpickering/cabal that referenced this pull request Jun 14, 2024
verifyFetchedTarball has the effect of deserialising the index tarball
(see call to Sec.withIndex).

verifyFetchedTarball is called individually for each package in the
build plan (see ProjectPlanning.hs). Not once per repo.

The hackage tarball is now 880mb so it takes a non significant amount of
time to deserialise this (much better after haskell/tar#95).

This code path is important as it can add 1s with these 38 calls on to
the initial load of a project and scales linearly with the size of your
build tree.

Reproducer: Simple project with "lens" dependency deserialises the index tarball 38 times.

Solution: Refactor verifyFetchedTarball to run once per repo rather than once per package.

In future it would be much better to refactor this function so that the
items are not immediately grouped and ungrouped but I didn't want to
take that on immediately.

Fixes haskell#10110
mpickering added a commit to mpickering/cabal that referenced this pull request Jun 14, 2024
verifyFetchedTarball has the effect of deserialising the index tarball
(see call to Sec.withIndex).

verifyFetchedTarball is called individually for each package in the
build plan (see ProjectPlanning.hs). Not once per repo.

The hackage tarball is now 880mb so it takes a non significant amount of
time to deserialise this (much better after haskell/tar#95).

This code path is important as it can add 1s with these 38 calls on to
the initial load of a project and scales linearly with the size of your
build tree.

Reproducer: Simple project with "lens" dependency deserialises the index tarball 38 times.

Solution: Refactor verifyFetchedTarball to run once per repo rather than once per package.

In future it would be much better to refactor this function so that the
items are not immediately grouped and ungrouped but I didn't want to
take that on immediately.

Fixes haskell#10110
@alt-romes
Copy link
Contributor Author

This is ready @Bodigrim !

@alt-romes alt-romes requested a review from Bodigrim June 14, 2024 11:42
@Bodigrim Bodigrim merged commit 6536e03 into haskell:master Jun 14, 2024
22 checks passed
@Bodigrim
Copy link
Contributor

Benchmarks of aarch64 with GHC 9.6:

Before
  deserialise index: OK
    10.4 ms ± 904 μs, 105 MB allocated,  25 KB copied, 124 MB peak memory
After
  deserialise index: OK
    2.47 ms ± 210 μs, 4.5 MB allocated, 398 B  copied, 124 MB peak memory, 76% less than baseline

This is very impressive, thanks, guys!

@alt-romes
Copy link
Contributor Author

alt-romes commented Jun 14, 2024

Benchmarks of aarch64 with vs without the patch that makes GHC produce assembly for byteswap rather than a function call:

Before

$ cabal run bench -- -p "deserialise" --csv before.x
All
  deserialise index: OK
    2.34 ms ± 213 μs

After

cabal run -w /Users/romes/ghc-dev/ghc/_build/stage1/bin/ghc bench -- -p "deserialise" --baseline before.x
All
  deserialise index: OK
    768  μs ±  61 μs, 67% less than baseline

@Bodigrim
Copy link
Contributor

Released as tar-0.6.3.0; it would be great if the forthcoming cabal-install-3.12 release picks it up.

Mikolaj pushed a commit to mpickering/cabal that referenced this pull request Jun 16, 2024
verifyFetchedTarball has the effect of deserialising the index tarball
(see call to Sec.withIndex).

verifyFetchedTarball is called individually for each package in the
build plan (see ProjectPlanning.hs). Not once per repo.

The hackage tarball is now 880mb so it takes a non significant amount of
time to deserialise this (much better after haskell/tar#95).

This code path is important as it can add 1s with these 38 calls on to
the initial load of a project and scales linearly with the size of your
build tree.

Reproducer: Simple project with "lens" dependency deserialises the index tarball 38 times.

Solution: Refactor verifyFetchedTarball to run once per repo rather than once per package.

In future it would be much better to refactor this function so that the
items are not immediately grouped and ungrouped but I didn't want to
take that on immediately.

Fixes haskell#10110
erikd pushed a commit to erikd/cabal that referenced this pull request Jun 17, 2024
verifyFetchedTarball has the effect of deserialising the index tarball
(see call to Sec.withIndex).

verifyFetchedTarball is called individually for each package in the
build plan (see ProjectPlanning.hs). Not once per repo.

The hackage tarball is now 880mb so it takes a non significant amount of
time to deserialise this (much better after haskell/tar#95).

This code path is important as it can add 1s with these 38 calls on to
the initial load of a project and scales linearly with the size of your
build tree.

Reproducer: Simple project with "lens" dependency deserialises the index tarball 38 times.

Solution: Refactor verifyFetchedTarball to run once per repo rather than once per package.

In future it would be much better to refactor this function so that the
items are not immediately grouped and ungrouped but I didn't want to
take that on immediately.

Fixes haskell#10110
mergify bot pushed a commit to haskell/cabal that referenced this pull request Jun 18, 2024
verifyFetchedTarball has the effect of deserialising the index tarball
(see call to Sec.withIndex).

verifyFetchedTarball is called individually for each package in the
build plan (see ProjectPlanning.hs). Not once per repo.

The hackage tarball is now 880mb so it takes a non significant amount of
time to deserialise this (much better after haskell/tar#95).

This code path is important as it can add 1s with these 38 calls on to
the initial load of a project and scales linearly with the size of your
build tree.

Reproducer: Simple project with "lens" dependency deserialises the index tarball 38 times.

Solution: Refactor verifyFetchedTarball to run once per repo rather than once per package.

In future it would be much better to refactor this function so that the
items are not immediately grouped and ungrouped but I didn't want to
take that on immediately.

Fixes #10110

(cherry picked from commit 7d46115)
mergify bot added a commit to haskell/cabal that referenced this pull request Jun 18, 2024
verifyFetchedTarball has the effect of deserialising the index tarball
(see call to Sec.withIndex).

verifyFetchedTarball is called individually for each package in the
build plan (see ProjectPlanning.hs). Not once per repo.

The hackage tarball is now 880mb so it takes a non significant amount of
time to deserialise this (much better after haskell/tar#95).

This code path is important as it can add 1s with these 38 calls on to
the initial load of a project and scales linearly with the size of your
build tree.

Reproducer: Simple project with "lens" dependency deserialises the index tarball 38 times.

Solution: Refactor verifyFetchedTarball to run once per repo rather than once per package.

In future it would be much better to refactor this function so that the
items are not immediately grouped and ungrouped but I didn't want to
take that on immediately.

Fixes #10110

(cherry picked from commit 7d46115)

Co-authored-by: Matthew Pickering <matthewtpickering@gmail.com>
mmhat pushed a commit to mmhat/cabal that referenced this pull request Jun 24, 2024
verifyFetchedTarball has the effect of deserialising the index tarball
(see call to Sec.withIndex).

verifyFetchedTarball is called individually for each package in the
build plan (see ProjectPlanning.hs). Not once per repo.

The hackage tarball is now 880mb so it takes a non significant amount of
time to deserialise this (much better after haskell/tar#95).

This code path is important as it can add 1s with these 38 calls on to
the initial load of a project and scales linearly with the size of your
build tree.

Reproducer: Simple project with "lens" dependency deserialises the index tarball 38 times.

Solution: Refactor verifyFetchedTarball to run once per repo rather than once per package.

In future it would be much better to refactor this function so that the
items are not immediately grouped and ungrouped but I didn't want to
take that on immediately.

Fixes haskell#10110
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants