Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

foldl' doens't inline on basic example #329

Closed
Boarders opened this issue Nov 25, 2020 · 4 comments · Fixed by #345
Closed

foldl' doens't inline on basic example #329

Boarders opened this issue Nov 25, 2020 · 4 comments · Fixed by #345
Milestone

Comments

@Boarders
Copy link
Contributor

I have the following code:

countBracketsBA :: ByteArray -> Int
countBracketsBA = foldlBA addBA 0
  where
    addBA :: Int -> Word8 -> Int
    addBA acc 40 = acc + 1
    addBA acc 41 = acc - 1
    addBA acc _  = acc

foldlBA :: forall a b. (Prim a) => (b -> a -> b) -> b -> ByteArray -> b
foldlBA f z arr = go 0 z
  where
    go i acc
      | i < maxI  = go (i + 1) (f acc (indexByteArray arr i)) 
      | otherwise = acc
    maxI = sizeofByteArray arr `quot` sizeOf (undefined :: a)

countBrackets :: ByteString -> Int
countBrackets = ByteString.foldl' addBS 0
  where
    addBS :: Int -> Word8 -> Int
    addBS acc 40 = acc + 1
    addBS acc 41 = acc - 1
    addBS acc _  = acc

I then run this with the following benchmark:

n :: Int
n = 100000

input :: [Word8]
input = replicate n 40 <> replicate n 41

getInputs :: IO (ByteArray.ByteArray, ByteString.ByteString)
getInputs =
  do
    let ba = ByteArray.byteArrayFromList input
    let bs = ByteString.pack input
    pure (ba, bs)

main :: IO ()
main = do
  defaultMain . pure $
    env getInputs $ \ ~(bytes, bs) ->
      bgroup "bracket count"
        [ bench "BA" $ nf countBracketsBA bytes
        , bench "BS" $ nf countBrackets   bs       
        ]

This gives the following result:

benchmarking bracket count/BA
time                 259.7 μs   (257.4 μs .. 261.8 μs)
                     0.999 R²   (0.999 R² .. 1.000 R²)
mean                 258.9 μs   (257.5 μs .. 260.7 μs)
std dev              5.351 μs   (3.942 μs .. 8.183 μs)
variance introduced by outliers: 14% (moderately inflated)

benchmarking bracket count/BS
time                 1.302 ms   (1.265 ms .. 1.339 ms)
                     0.996 R²   (0.995 R² .. 0.999 R²)
mean                 1.263 ms   (1.252 ms .. 1.279 ms)
std dev              45.81 μs   (32.79 μs .. 63.27 μs)
variance introduced by outliers: 25% (moderately inflated)

Lyxia (in the fp slack) informed me that this is because foldl' was not being inlined and indeed if I add a definition in the same module (due to Lyxia):

myfoldl' :: (a -> Word8 -> a) -> a -> ByteString -> a
myfoldl' f v (PS fp off len) =
      accursedUnutterablePerformIO $ withForeignPtr fp $ \p ->
        go v (p `plusPtr` off) (p `plusPtr` (off+len))
    where
      -- tail recursive; traverses array left to right
      go !z !p !q | p == q    = return z
                  | otherwise = do x <- peek p
                                   go (f z x) (p `plusPtr` 1) q

then I get good performance:

benchmarking bracket count/BS inline
time                 177.8 μs   (176.4 μs .. 179.8 μs)
                     0.999 R²   (0.999 R² .. 1.000 R²)
mean                 178.8 μs   (177.6 μs .. 182.9 μs)
std dev              6.526 μs   (2.579 μs .. 12.89 μs)
variance introduced by outliers: 34% (moderately inflated)

I'm not quite sure if this is an issue with GHC or bytestring but it doesn't seem ideal to have such behavior. This was tested with bytestring-0.10.10.1 and ghc-8.8.4 so I don't know if the same issue appears in the latest release nor the latest GHC.

@Bodigrim
Copy link
Contributor

Probably similar to #23. A PR rewriting fold{l,r}{,'} to the form foldl' f v = \(BS fp len) -> ... would be appreciated.

@Boarders
Copy link
Contributor Author

I'll check that gives the desired performance and then give it a go.

@Boarders
Copy link
Contributor Author

I found that the eta-expansion does indeed work but surprisingly the best performing version resulted from simply removing the INLINE pragma, I wonder if any experimentation has been done with that idea in bytestring?

@Bodigrim
Copy link
Contributor

I have not experimented much with inlining in bytestring. It would be interesting to compare Core output for countBrackets.

{-# OPTIONS_GHC -ddump-simpl -dsuppress-all -dno-suppress-type-signatures #-}

@Bodigrim Bodigrim linked a pull request Jan 13, 2021 that will close this issue
@Bodigrim Bodigrim added this to the 0.11.2.0 milestone Aug 5, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants