-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve deserialisation performance #95
Conversation
002d68a
to
c6967db
Compare
Please rebase atop of the latest |
@alt-romes if you get stuck for a specific reason, I'm happy to take a closer look, but otherwise please ping me when CI becomes green. |
2864520
to
556bca0
Compare
Codec/Archive/Tar/Index/Utils.hs
Outdated
@@ -63,10 +65,16 @@ readInt32BE :: BS.ByteString -> Int -> Int32 | |||
readInt32BE bs i = fromIntegral (readWord32BE bs i) | |||
{-# INLINE readInt32BE #-} | |||
|
|||
readWord32OffPtrBE :: Ptr Word32 -> Int -> IO Word32 | |||
readWord32OffPtrBE ptr i = do | |||
case targetByteOrder of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
targetByteOrder
was broken before GHC 8.10, so it would be more robust to check #if defined(WORDS_BIGENDIAN)
.
Looks great modulo one suggestion! How can I reproduce the performance gains? |
I suggest building Cabal with a dependency on this patched tar with profiling. Then, build another Cabal with the profiled Cabal with So... that would be (writing on my phone, there may be some mistakes):
You should see the Then do Note: on aarch64 I got an improvement from some 2s to 250ms on deserialize. Then, I investigated the loop assembly and discovered the next bottleneck was a call to hs_bswap32 in the middle of the loop. I implemented the assembly native code generator for this instruction and put up a PR today. This reduced the time some 3x, to 70ms (with this patched GHV). So, on x64 you may be able to get even better than 200ms for deserialise right off the bat due to the ncg already producing proper assembly for byte swap in that architecture. Cheers! |
I have added a benchmark which shows the improvement. |
Use the lower-level array-construction primitives to avoid intermediate allocations and perform much better at deserialisation. On Cabal (which uses tar for the hackage index), we observed: * Deserialisation of IntTries go from 1.5s to 200ms, with 10GB of allocations going down to roughly 600MB. * StringTable deserialization go from 700ms to 50ms, with 4GB of allocations going down to 80MB. Unfortunately, the newGenArray primitive was only introduced in array 0.5.6. Since we can't update the bound to force such a recent version of array, we implement the beToLe function using unboxed array primitives that have been long available, rather than newGenArray.
This improves from: deserialise index: OK 10.0 ms ± 246 μs to: deserialise index: OK 527 μs ± 43 μs Due to the previous commit
verifyFetchedTarball has the effect of deserialising the index tarball (see call to Sec.withIndex). verifyFetchedTarball is called individually for each package in the build plan (see ProjectPlanning.hs). Not once per repo. The hackage tarball is now 880mb so it takes a non significant amount of time to deserialise this (much better after haskell/tar#95). This code path is important as it can add 1s with these 38 calls on to the initial load of a project and scales linearly with the size of your build tree. Reproducer: Simple project with "lens" dependency deserialises the index tarball 38 times. Solution: Refactor verifyFetchedTarball to run once per repo rather than once per package. In future it would be much better to refactor this function so that the items are not immediately grouped and ungrouped but I didn't want to take that on immediately. Fixes haskell#10110
verifyFetchedTarball has the effect of deserialising the index tarball (see call to Sec.withIndex). verifyFetchedTarball is called individually for each package in the build plan (see ProjectPlanning.hs). Not once per repo. The hackage tarball is now 880mb so it takes a non significant amount of time to deserialise this (much better after haskell/tar#95). This code path is important as it can add 1s with these 38 calls on to the initial load of a project and scales linearly with the size of your build tree. Reproducer: Simple project with "lens" dependency deserialises the index tarball 38 times. Solution: Refactor verifyFetchedTarball to run once per repo rather than once per package. In future it would be much better to refactor this function so that the items are not immediately grouped and ungrouped but I didn't want to take that on immediately. Fixes haskell#10110
verifyFetchedTarball has the effect of deserialising the index tarball (see call to Sec.withIndex). verifyFetchedTarball is called individually for each package in the build plan (see ProjectPlanning.hs). Not once per repo. The hackage tarball is now 880mb so it takes a non significant amount of time to deserialise this (much better after haskell/tar#95). This code path is important as it can add 1s with these 38 calls on to the initial load of a project and scales linearly with the size of your build tree. Reproducer: Simple project with "lens" dependency deserialises the index tarball 38 times. Solution: Refactor verifyFetchedTarball to run once per repo rather than once per package. In future it would be much better to refactor this function so that the items are not immediately grouped and ungrouped but I didn't want to take that on immediately. Fixes haskell#10110
This is ready @Bodigrim ! |
Benchmarks of aarch64 with GHC 9.6:
This is very impressive, thanks, guys! |
Benchmarks of aarch64 with vs without the patch that makes GHC produce assembly for byteswap rather than a function call: Before
After
|
Released as |
verifyFetchedTarball has the effect of deserialising the index tarball (see call to Sec.withIndex). verifyFetchedTarball is called individually for each package in the build plan (see ProjectPlanning.hs). Not once per repo. The hackage tarball is now 880mb so it takes a non significant amount of time to deserialise this (much better after haskell/tar#95). This code path is important as it can add 1s with these 38 calls on to the initial load of a project and scales linearly with the size of your build tree. Reproducer: Simple project with "lens" dependency deserialises the index tarball 38 times. Solution: Refactor verifyFetchedTarball to run once per repo rather than once per package. In future it would be much better to refactor this function so that the items are not immediately grouped and ungrouped but I didn't want to take that on immediately. Fixes haskell#10110
verifyFetchedTarball has the effect of deserialising the index tarball (see call to Sec.withIndex). verifyFetchedTarball is called individually for each package in the build plan (see ProjectPlanning.hs). Not once per repo. The hackage tarball is now 880mb so it takes a non significant amount of time to deserialise this (much better after haskell/tar#95). This code path is important as it can add 1s with these 38 calls on to the initial load of a project and scales linearly with the size of your build tree. Reproducer: Simple project with "lens" dependency deserialises the index tarball 38 times. Solution: Refactor verifyFetchedTarball to run once per repo rather than once per package. In future it would be much better to refactor this function so that the items are not immediately grouped and ungrouped but I didn't want to take that on immediately. Fixes haskell#10110
verifyFetchedTarball has the effect of deserialising the index tarball (see call to Sec.withIndex). verifyFetchedTarball is called individually for each package in the build plan (see ProjectPlanning.hs). Not once per repo. The hackage tarball is now 880mb so it takes a non significant amount of time to deserialise this (much better after haskell/tar#95). This code path is important as it can add 1s with these 38 calls on to the initial load of a project and scales linearly with the size of your build tree. Reproducer: Simple project with "lens" dependency deserialises the index tarball 38 times. Solution: Refactor verifyFetchedTarball to run once per repo rather than once per package. In future it would be much better to refactor this function so that the items are not immediately grouped and ungrouped but I didn't want to take that on immediately. Fixes #10110 (cherry picked from commit 7d46115)
verifyFetchedTarball has the effect of deserialising the index tarball (see call to Sec.withIndex). verifyFetchedTarball is called individually for each package in the build plan (see ProjectPlanning.hs). Not once per repo. The hackage tarball is now 880mb so it takes a non significant amount of time to deserialise this (much better after haskell/tar#95). This code path is important as it can add 1s with these 38 calls on to the initial load of a project and scales linearly with the size of your build tree. Reproducer: Simple project with "lens" dependency deserialises the index tarball 38 times. Solution: Refactor verifyFetchedTarball to run once per repo rather than once per package. In future it would be much better to refactor this function so that the items are not immediately grouped and ungrouped but I didn't want to take that on immediately. Fixes #10110 (cherry picked from commit 7d46115) Co-authored-by: Matthew Pickering <matthewtpickering@gmail.com>
verifyFetchedTarball has the effect of deserialising the index tarball (see call to Sec.withIndex). verifyFetchedTarball is called individually for each package in the build plan (see ProjectPlanning.hs). Not once per repo. The hackage tarball is now 880mb so it takes a non significant amount of time to deserialise this (much better after haskell/tar#95). This code path is important as it can add 1s with these 38 calls on to the initial load of a project and scales linearly with the size of your build tree. Reproducer: Simple project with "lens" dependency deserialises the index tarball 38 times. Solution: Refactor verifyFetchedTarball to run once per repo rather than once per package. In future it would be much better to refactor this function so that the items are not immediately grouped and ungrouped but I didn't want to take that on immediately. Fixes haskell#10110
Use the lower-level array-construction primitives to avoid intermediate allocations and perform much better at deserialisation.
On Cabal (which uses tar for the hackage index), we observed: