Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Try optimizing loops in Data.ByteString.findIndex[End] in the style of #273 #338

Closed
sjakobi opened this issue Dec 19, 2020 · 14 comments · Fixed by #347
Closed

Try optimizing loops in Data.ByteString.findIndex[End] in the style of #273 #338

sjakobi opened this issue Dec 19, 2020 · 14 comments · Fixed by #347
Milestone

Comments

@sjakobi
Copy link
Member

sjakobi commented Dec 19, 2020

In #273 we optimized the inner loops of several functions by floating out a static argument. findIndex and findIndexEnd have a similar format as the optimized functions, so it would be good to check whether they can be improved in a similar way. findIndexOrEnd is probably the most similar of the already optimized functions.

bytestring/Data/ByteString.hs

Lines 1340 to 1366 in 8c631df

-- | /O(n)/ The 'findIndex' function takes a predicate and a 'ByteString' and
-- returns the index of the first element in the ByteString
-- satisfying the predicate.
findIndex :: (Word8 -> Bool) -> ByteString -> Maybe Int
findIndex k (BS x l) = accursedUnutterablePerformIO $ withForeignPtr x $ \f -> go f 0
where
go !ptr !n | n >= l = return Nothing
| otherwise = do w <- peek ptr
if k w
then return (Just n)
else go (ptr `plusPtr` 1) (n+1)
{-# INLINE [1] findIndex #-}
-- | /O(n)/ The 'findIndexEnd' function takes a predicate and a 'ByteString' and
-- returns the index of the last element in the ByteString
-- satisfying the predicate.
--
-- @since 0.10.12.0
findIndexEnd :: (Word8 -> Bool) -> ByteString -> Maybe Int
findIndexEnd k (BS x l) = accursedUnutterablePerformIO $ withForeignPtr x $ \ f -> go f (l-1)
where
go !ptr !n | n < 0 = return Nothing
| otherwise = do w <- peekByteOff ptr n
if k w
then return (Just n)
else go ptr (n-1)
{-# INLINE findIndexEnd #-}

There might be even more functions that could be optimized in a similar way.

@Boarders
Copy link
Contributor

Just tried this out on FindIndexEnd using the definition:

findIndexEnd :: (Word8 -> Bool) -> ByteString -> Maybe Int
findIndexEnd k (BS x l) = accursedUnutterablePerformIO $ withForeignPtr x $
  \fp ->
    let
      go !n | n < 0     = return Nothing
            | otherwise = do w <- peekByteOff fp n
                             if k w
                               then return (Just n)
                               else go (n-1)
    in
      go (l-1)
{-# INLINE findIndexEnd #-}

On the findIndexOrEnd benchmarks I get the following numbers:
Old:

benchmarked Data.ByteString.Builder/findIndexOrEnd/takeWhile
time                 43.82 μs   (41.57 μs .. 45.92 μs)
                     0.983 R²   (0.972 R² .. 0.992 R²)
mean                 40.33 μs   (39.63 μs .. 41.40 μs)
std dev              2.805 μs   (2.197 μs .. 3.581 μs)
variance introduced by outliers: 44% (moderately inflated)

benchmarked Data.ByteString.Builder/findIndexOrEnd/dropWhile
time                 50.56 μs   (46.88 μs .. 53.94 μs)
                     0.965 R²   (0.946 R² .. 0.982 R²)
mean                 50.34 μs   (48.86 μs .. 52.13 μs)
std dev              5.332 μs   (4.500 μs .. 6.252 μs)
variance introduced by outliers: 65% (severely inflated)

benchmarked Data.ByteString.Builder/findIndexOrEnd/break
time                 53.73 μs   (47.22 μs .. 60.65 μs)
                     0.946 R²   (0.920 R² .. 0.982 R²)
mean                 45.18 μs   (43.99 μs .. 47.19 μs)
std dev              4.975 μs   (3.247 μs .. 8.563 μs)
variance introduced by outliers: 65% (severely inflated)

benchmarked Data.ByteString.Builder/findIndexOrEnd/group zeroes
time                 10.68 μs   (10.22 μs .. 11.17 μs)
                     0.986 R²   (0.979 R² .. 0.992 R²)
mean                 10.81 μs   (10.59 μs .. 11.03 μs)
std dev              760.1 ns   (612.6 ns .. 955.9 ns)
variance introduced by outliers: 46% (moderately inflated)

benchmarked Data.ByteString.Builder/findIndexOrEnd/group zero-one
time                 195.7 μs   (177.0 μs .. 214.3 μs)
                     0.904 R²   (0.809 R² .. 0.960 R²)
mean                 271.9 μs   (255.3 μs .. 299.1 μs)
std dev              69.81 μs   (49.52 μs .. 112.3 μs)
variance introduced by outliers: 92% (severely inflated)

benchmarked Data.ByteString.Builder/findIndexOrEnd/groupBy (>=)
time                 168.5 μs   (158.0 μs .. 178.7 μs)
                     0.965 R²   (0.935 R² .. 0.984 R²)
mean                 169.7 μs   (164.3 μs .. 178.6 μs)
std dev              21.76 μs   (14.94 μs .. 34.62 μs)
variance introduced by outliers: 73% (severely inflated)

benchmarked Data.ByteString.Builder/findIndexOrEnd/groupBy (>)
time                 411.0 μs   (391.5 μs .. 426.7 μs)
                     0.976 R²   (0.950 R² .. 0.992 R²)
mean                 457.0 μs   (444.2 μs .. 480.7 μs)
std dev              58.38 μs   (30.23 μs .. 98.57 μs)
variance introduced by outliers: 73% (severely inflated)

Newer:

 lengths of input data: [10000,10000,10000,10000,10000,10000,10000]
benchmarked Data.ByteString.Builder/findIndexOrEnd/takeWhile
time                 30.86 μs   (29.80 μs .. 31.88 μs)
                     0.992 R²   (0.987 R² .. 0.995 R²)
mean                 34.69 μs   (33.96 μs .. 35.63 μs)
std dev              2.791 μs   (2.349 μs .. 3.369 μs)
variance introduced by outliers: 52% (severely inflated)

benchmarked Data.ByteString.Builder/findIndexOrEnd/dropWhile
time                 42.38 μs   (42.20 μs .. 42.58 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 42.43 μs   (42.33 μs .. 42.57 μs)
std dev              387.7 ns   (297.9 ns .. 530.9 ns)

benchmarked Data.ByteString.Builder/findIndexOrEnd/break
time                 47.65 μs   (47.45 μs .. 47.86 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 47.73 μs   (47.60 μs .. 47.84 μs)
std dev              412.0 ns   (329.8 ns .. 558.9 ns)

benchmarked Data.ByteString.Builder/findIndexOrEnd/group zeroes
time                 5.606 μs   (5.490 μs .. 5.718 μs)
                     0.998 R²   (0.997 R² .. 0.999 R²)
mean                 5.475 μs   (5.445 μs .. 5.512 μs)
std dev              112.7 ns   (89.01 ns .. 137.1 ns)

benchmarked Data.ByteString.Builder/findIndexOrEnd/group zero-one
time                 198.3 μs   (195.1 μs .. 201.1 μs)
                     0.999 R²   (0.998 R² .. 0.999 R²)
mean                 199.9 μs   (198.8 μs .. 201.0 μs)
std dev              3.806 μs   (3.205 μs .. 4.631 μs)

benchmarked Data.ByteString.Builder/findIndexOrEnd/groupBy (>=)
time                 149.8 μs   (143.5 μs .. 156.7 μs)
                     0.991 R²   (0.987 R² .. 0.996 R²)
mean                 143.3 μs   (141.3 μs .. 145.4 μs)
std dev              7.011 μs   (5.880 μs .. 8.069 μs)
variance introduced by outliers: 27% (moderately inflated)

benchmarked Data.ByteString.Builder/findIndexOrEnd/groupBy (>)
time                 395.6 μs   (388.0 μs .. 401.6 μs)
                     0.999 R²   (0.998 R² .. 0.999 R²)
mean                 392.0 μs   (390.1 μs .. 393.9 μs)
std dev              6.165 μs   (5.336 μs .. 7.521 μs)

Haven't yet experimented with a similar technique for findIndex though I am also unsure where I would benchmark it. I can make a PR later this afternoon if that is worthwhile?

@Bodigrim
Copy link
Contributor

@Boarders if you run benchmarks with --short, reports will be less verbose and easier to compare.

Feel free to throw new benchmarks into BenchAll.hs.

A PR would be much appreciated.

@Boarders
Copy link
Contributor

In the above I was getting my performance numbers backwards, I tried out

findIndexEnd :: (Word8 -> Bool) -> ByteString -> Maybe Int
findIndexEnd k (BS x l) = accursedUnutterablePerformIO $ withForeignPtr x $
  \fp ->
    let
      go !n | n < 0     = return Nothing
            | otherwise = do w <- peekByteOff fp n
                             if k w
                               then return (Just n)
                               else go (n-1)
    in
      go (l-1)
{-# INLINE findIndexEnd #-}
findIndexEnd :: (Word8 -> Bool) -> ByteString -> Maybe Int
findIndexEnd k (BS x l) = accursedUnutterablePerformIO $ withForeignPtr x $ \ fp ->
  let
    start = fp `plusPtr` (l - 1) 
    end   = fp `plusPtr` (- 1)
    go !ptr | ptr == end = return Nothing
            | otherwise  = do w <- peek ptr
                              if k w
                                then return (Just ((ptr `minusPtr` end) - 1))
                                else go (ptr `plusPtr` (- 1))
  in                                
    go start
{-# INLINE findIndexEnd #-}

and the same sort of variants on findIndex. None led to a performance improvement for me.

@sjakobi
Copy link
Member Author

sjakobi commented Jan 14, 2021

None led to a performance improvement for me.

@Boarders Could it be that the benchmarks you're looking at simply don't exercise the findIndexEnd function you're changing? Above you're showing results for findIndexOrEnd (note the Or), but this is a different function than findIndexEnd.

@Boarders
Copy link
Contributor

Ah ok, I am dumb ( :D ). Let me check again!

@sjakobi
Copy link
Member Author

sjakobi commented Jan 14, 2021

No worries, @Boarders! The naming is definitely confusing. We could consider changing the findIndexOrEnd name to avoid similar situations in the future.

@Boarders
Copy link
Contributor

I wrote some (not excellent) benchmarks and this change does improve things:
old:

findIndex1/findIndices                   mean 558.7 ns  ( +- 12.44 ns  )
findIndex1/find                          mean 49.60 ns  ( +- 943.7 ps  )
findIndex1End/findIndexEnd               mean 9.980 ns  ( +- 345.4 ps  )
findIndex1End/elemIndexInd               mean 558.2 ns  ( +- 7.707 ns  )

new:

findIndex/findIndices                   mean 564.7 ns  ( +- 14.05 ns  )
findIndex/find                          mean 34.50 ns  ( +- 321.5 ps  )
findIndexEnd/findIndexEnd               mean 8.599 ns  ( +- 113.2 ps  )
findIndexEnd/elemIndexInd               mean 477.9 ns  ( +- 8.306 ns  )

I'll see if there are any other obvious functions like this and then make a PR.

@Boarders
Copy link
Contributor

I just discovered even map benefits from this sort of transformation!

Changing:

map :: (Word8 -> Word8) -> ByteString -> ByteString
map f (BS fp len) = unsafeDupablePerformIO $ withForeignPtr fp $ \a ->
    create len $ map_ 0 a
  where
    map_ :: Int -> Ptr Word8 -> Ptr Word8 -> IO ()
    map_ !n !p1 !p2
       | n >= len = return ()
       | otherwise = do
            x <- peekByteOff p1 n
            pokeByteOff p2 n (f x)
            map_ (n+1) p1 p2
{-# INLINE map #-}

to

map :: (Word8 -> Word8) -> ByteString -> ByteString
map f (BS fp len) = unsafeDupablePerformIO $ unsafeWithForeignPtr fp $ \ptr1 ->
    create len $ \ptr2 -> m ptr1 ptr2
  where
    m p1 p2 = map_ 0
      where
      map_ :: Int -> IO ()
      map_ !n
         | n >= len = return ()
         | otherwise = do
              x <- peekByteOff p1 n
              pokeByteOff p2 n (f x)
              map_ (n+1)
{-# INLINE map #-}

Gave me this on a benchmark:

old:
traversal/map                            mean 42.51 μs  ( +- 716.8 ns  )
new:
traversal/map                            mean 26.10 μs  ( +- 361.0 ns  )

@Bodigrim
Copy link
Contributor

@Boarders Awesome! Please check that there is no regression for shorter strings. This kind of optimization can potentially lead to a (constant) increase of heap allocations. If you'd be able to accompany your PR with Core dumps of modified functions before and after, it would help to review a lot. If new Core contains join or joinrec, it is a good sign; if let or letrec, then not quite. See an example at #273 (comment) and below.

@Boarders
Copy link
Contributor

Just quickly checked, it is faster (by a reasonable margin) on every example I have tried. I'll see what I can do about getting some core, if you have any recommended place to put it then that would help.

@sjakobi
Copy link
Member Author

sjakobi commented Jan 16, 2021

The naming is definitely confusing. We could consider changing the findIndexOrEnd name to avoid similar situations in the future.

#348 implements the renaming.

@Bodigrim Bodigrim linked a pull request Jan 16, 2021 that will close this issue
@Bodigrim Bodigrim added this to the 0.11.1.0 milestone Jan 16, 2021
@sjakobi
Copy link
Member Author

sjakobi commented Jan 16, 2021

@Boarders, for completeness, did you check that there are no more inner loops left where we might be able to reduce the number of arguments?

@Boarders
Copy link
Contributor

Boarders commented Jan 18, 2021

@sjakobi : The only other example I came across is the new packZipWith which is written as so:

packZipWith :: (Word8 -> Word8 -> Word8) -> ByteString -> ByteString -> ByteString
packZipWith f (BS fp l) (BS fq m) = unsafeDupablePerformIO $
    withForeignPtr fp $ \a ->
    withForeignPtr fq $ \b ->
    create len $ go a b
  where
    go p1 p2 = zipWith_ 0
      where
        zipWith_ :: Int -> Ptr Word8 -> IO ()
        zipWith_ !n !r
           | n >= len = return ()
           | otherwise = do
                x <- peekByteOff p1 n
                y <- peekByteOff p2 n
                pokeByteOff r n (f x y)
                zipWith_ (n+1) r

    len = min l m
{-# INLINE packZipWith #-}

That could be re-written to:

packZipWith :: (Word8 -> Word8 -> Word8) -> ByteString -> ByteString -> ByteString
packZipWith f (BS fp l) (BS fq m) = unsafeDupablePerformIO $
    withForeignPtr fp $ \srcPtr1 ->
    withForeignPtr fq $ \srcPtr2 ->
    create len $ \destPtr -> go srcPtr1 srcPtr2 destPtr
  where
    go p1 p2 dest = zipWith_ 0
      where
        zipWith_ :: Int -> IO ()
        zipWith_ !n
           | n >= len = return ()
           | otherwise = do
                x <- peekByteOff p1 n
                y <- peekByteOff p2 n
                pokeByteOff dest n (f x y)
                zipWith_ (n+1)

    len = min l m
  {-# INLINE packZipWith #-}

This one leads to better performance when not inlined (26.63 μs vs 37.06 μs) but worse performance when inlined (9.011 μs vs 5.940 μs). That is mysterious to me but I thought it better to leave it out of the issue. If someone wishes to investigate more then it might be interesting to do so.

@sjakobi
Copy link
Member Author

sjakobi commented Jan 18, 2021

@Boarders Indeed it would be nice to get a better understanding of the performance implications here. I've included packZipWith as a TODO-item in #350.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants