Try optimizing loops in `Data.ByteString.findIndex[End]` in the style of #273 #338

sjakobi · 2020-12-19T13:15:00Z

In #273 we optimized the inner loops of several functions by floating out a static argument. findIndex and findIndexEnd have a similar format as the optimized functions, so it would be good to check whether they can be improved in a similar way. findIndexOrEnd is probably the most similar of the already optimized functions.

bytestring/Data/ByteString.hs

Lines 1340 to 1366 in 8c631df

    
           -- | /O(n)/ The 'findIndex' function takes a predicate and a 'ByteString' and 
        
           -- returns the index of the first element in the ByteString 
        
           -- satisfying the predicate. 
        
           findIndex :: (Word8 -> Bool) -> ByteString -> Maybe Int 
        
           findIndex k (BS x l) = accursedUnutterablePerformIO $ withForeignPtr x $ \f -> go f 0 
        
             where 
        
               go !ptr !n | n >= l    = return Nothing 
        
                          | otherwise = do w <- peek ptr 
        
                                           if k w 
        
                                             then return (Just n) 
        
                                             else go (ptr `plusPtr` 1) (n+1) 
        
           {-# INLINE [1] findIndex #-} 
        
           -- | /O(n)/ The 'findIndexEnd' function takes a predicate and a 'ByteString' and 
        
           -- returns the index of the last element in the ByteString 
        
           -- satisfying the predicate. 
        
           -- 
        
           -- @since 0.10.12.0 
        
           findIndexEnd :: (Word8 -> Bool) -> ByteString -> Maybe Int 
        
           findIndexEnd k (BS x l) = accursedUnutterablePerformIO $ withForeignPtr x $ \ f -> go f (l-1) 
        
             where 
        
               go !ptr !n | n < 0     = return Nothing 
        
                          | otherwise = do w <- peekByteOff ptr n 
        
                                           if k w 
        
                                             then return (Just n) 
        
                                             else go ptr (n-1) 
        
           {-# INLINE findIndexEnd #-}

There might be even more functions that could be optimized in a similar way.

The text was updated successfully, but these errors were encountered:

Boarders · 2021-01-13T18:35:50Z

Just tried this out on FindIndexEnd using the definition:

findIndexEnd :: (Word8 -> Bool) -> ByteString -> Maybe Int
findIndexEnd k (BS x l) = accursedUnutterablePerformIO $ withForeignPtr x $
  \fp ->
    let
      go !n | n < 0     = return Nothing
            | otherwise = do w <- peekByteOff fp n
                             if k w
                               then return (Just n)
                               else go (n-1)
    in
      go (l-1)
{-# INLINE findIndexEnd #-}

On the findIndexOrEnd benchmarks I get the following numbers:
Old:

benchmarked Data.ByteString.Builder/findIndexOrEnd/takeWhile
time                 43.82 μs   (41.57 μs .. 45.92 μs)
                     0.983 R²   (0.972 R² .. 0.992 R²)
mean                 40.33 μs   (39.63 μs .. 41.40 μs)
std dev              2.805 μs   (2.197 μs .. 3.581 μs)
variance introduced by outliers: 44% (moderately inflated)

benchmarked Data.ByteString.Builder/findIndexOrEnd/dropWhile
time                 50.56 μs   (46.88 μs .. 53.94 μs)
                     0.965 R²   (0.946 R² .. 0.982 R²)
mean                 50.34 μs   (48.86 μs .. 52.13 μs)
std dev              5.332 μs   (4.500 μs .. 6.252 μs)
variance introduced by outliers: 65% (severely inflated)

benchmarked Data.ByteString.Builder/findIndexOrEnd/break
time                 53.73 μs   (47.22 μs .. 60.65 μs)
                     0.946 R²   (0.920 R² .. 0.982 R²)
mean                 45.18 μs   (43.99 μs .. 47.19 μs)
std dev              4.975 μs   (3.247 μs .. 8.563 μs)
variance introduced by outliers: 65% (severely inflated)

benchmarked Data.ByteString.Builder/findIndexOrEnd/group zeroes
time                 10.68 μs   (10.22 μs .. 11.17 μs)
                     0.986 R²   (0.979 R² .. 0.992 R²)
mean                 10.81 μs   (10.59 μs .. 11.03 μs)
std dev              760.1 ns   (612.6 ns .. 955.9 ns)
variance introduced by outliers: 46% (moderately inflated)

benchmarked Data.ByteString.Builder/findIndexOrEnd/group zero-one
time                 195.7 μs   (177.0 μs .. 214.3 μs)
                     0.904 R²   (0.809 R² .. 0.960 R²)
mean                 271.9 μs   (255.3 μs .. 299.1 μs)
std dev              69.81 μs   (49.52 μs .. 112.3 μs)
variance introduced by outliers: 92% (severely inflated)

benchmarked Data.ByteString.Builder/findIndexOrEnd/groupBy (>=)
time                 168.5 μs   (158.0 μs .. 178.7 μs)
                     0.965 R²   (0.935 R² .. 0.984 R²)
mean                 169.7 μs   (164.3 μs .. 178.6 μs)
std dev              21.76 μs   (14.94 μs .. 34.62 μs)
variance introduced by outliers: 73% (severely inflated)

benchmarked Data.ByteString.Builder/findIndexOrEnd/groupBy (>)
time                 411.0 μs   (391.5 μs .. 426.7 μs)
                     0.976 R²   (0.950 R² .. 0.992 R²)
mean                 457.0 μs   (444.2 μs .. 480.7 μs)
std dev              58.38 μs   (30.23 μs .. 98.57 μs)
variance introduced by outliers: 73% (severely inflated)

Newer:

 lengths of input data: [10000,10000,10000,10000,10000,10000,10000]
benchmarked Data.ByteString.Builder/findIndexOrEnd/takeWhile
time                 30.86 μs   (29.80 μs .. 31.88 μs)
                     0.992 R²   (0.987 R² .. 0.995 R²)
mean                 34.69 μs   (33.96 μs .. 35.63 μs)
std dev              2.791 μs   (2.349 μs .. 3.369 μs)
variance introduced by outliers: 52% (severely inflated)

benchmarked Data.ByteString.Builder/findIndexOrEnd/dropWhile
time                 42.38 μs   (42.20 μs .. 42.58 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 42.43 μs   (42.33 μs .. 42.57 μs)
std dev              387.7 ns   (297.9 ns .. 530.9 ns)

benchmarked Data.ByteString.Builder/findIndexOrEnd/break
time                 47.65 μs   (47.45 μs .. 47.86 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 47.73 μs   (47.60 μs .. 47.84 μs)
std dev              412.0 ns   (329.8 ns .. 558.9 ns)

benchmarked Data.ByteString.Builder/findIndexOrEnd/group zeroes
time                 5.606 μs   (5.490 μs .. 5.718 μs)
                     0.998 R²   (0.997 R² .. 0.999 R²)
mean                 5.475 μs   (5.445 μs .. 5.512 μs)
std dev              112.7 ns   (89.01 ns .. 137.1 ns)

benchmarked Data.ByteString.Builder/findIndexOrEnd/group zero-one
time                 198.3 μs   (195.1 μs .. 201.1 μs)
                     0.999 R²   (0.998 R² .. 0.999 R²)
mean                 199.9 μs   (198.8 μs .. 201.0 μs)
std dev              3.806 μs   (3.205 μs .. 4.631 μs)

benchmarked Data.ByteString.Builder/findIndexOrEnd/groupBy (>=)
time                 149.8 μs   (143.5 μs .. 156.7 μs)
                     0.991 R²   (0.987 R² .. 0.996 R²)
mean                 143.3 μs   (141.3 μs .. 145.4 μs)
std dev              7.011 μs   (5.880 μs .. 8.069 μs)
variance introduced by outliers: 27% (moderately inflated)

benchmarked Data.ByteString.Builder/findIndexOrEnd/groupBy (>)
time                 395.6 μs   (388.0 μs .. 401.6 μs)
                     0.999 R²   (0.998 R² .. 0.999 R²)
mean                 392.0 μs   (390.1 μs .. 393.9 μs)
std dev              6.165 μs   (5.336 μs .. 7.521 μs)

Haven't yet experimented with a similar technique for findIndex though I am also unsure where I would benchmark it. I can make a PR later this afternoon if that is worthwhile?

Bodigrim · 2021-01-13T19:31:14Z

@Boarders if you run benchmarks with --short, reports will be less verbose and easier to compare.

Feel free to throw new benchmarks into BenchAll.hs.

A PR would be much appreciated.

Boarders · 2021-01-14T18:55:27Z

In the above I was getting my performance numbers backwards, I tried out

findIndexEnd :: (Word8 -> Bool) -> ByteString -> Maybe Int
findIndexEnd k (BS x l) = accursedUnutterablePerformIO $ withForeignPtr x $
  \fp ->
    let
      go !n | n < 0     = return Nothing
            | otherwise = do w <- peekByteOff fp n
                             if k w
                               then return (Just n)
                               else go (n-1)
    in
      go (l-1)
{-# INLINE findIndexEnd #-}

findIndexEnd :: (Word8 -> Bool) -> ByteString -> Maybe Int
findIndexEnd k (BS x l) = accursedUnutterablePerformIO $ withForeignPtr x $ \ fp ->
  let
    start = fp `plusPtr` (l - 1) 
    end   = fp `plusPtr` (- 1)
    go !ptr | ptr == end = return Nothing
            | otherwise  = do w <- peek ptr
                              if k w
                                then return (Just ((ptr `minusPtr` end) - 1))
                                else go (ptr `plusPtr` (- 1))
  in                                
    go start
{-# INLINE findIndexEnd #-}

and the same sort of variants on findIndex. None led to a performance improvement for me.

sjakobi · 2021-01-14T21:39:03Z

None led to a performance improvement for me.

@Boarders Could it be that the benchmarks you're looking at simply don't exercise the findIndexEnd function you're changing? Above you're showing results for findIndexOrEnd (note the Or), but this is a different function than findIndexEnd.

Boarders · 2021-01-14T22:08:59Z

Ah ok, I am dumb ( :D ). Let me check again!

sjakobi · 2021-01-14T22:17:18Z

No worries, @Boarders! The naming is definitely confusing. We could consider changing the findIndexOrEnd name to avoid similar situations in the future.

Boarders · 2021-01-14T23:03:12Z

I wrote some (not excellent) benchmarks and this change does improve things:
old:

findIndex1/findIndices                   mean 558.7 ns  ( +- 12.44 ns  )
findIndex1/find                          mean 49.60 ns  ( +- 943.7 ps  )
findIndex1End/findIndexEnd               mean 9.980 ns  ( +- 345.4 ps  )
findIndex1End/elemIndexInd               mean 558.2 ns  ( +- 7.707 ns  )

new:

findIndex/findIndices                   mean 564.7 ns  ( +- 14.05 ns  )
findIndex/find                          mean 34.50 ns  ( +- 321.5 ps  )
findIndexEnd/findIndexEnd               mean 8.599 ns  ( +- 113.2 ps  )
findIndexEnd/elemIndexInd               mean 477.9 ns  ( +- 8.306 ns  )

I'll see if there are any other obvious functions like this and then make a PR.

Boarders · 2021-01-14T23:39:41Z

I just discovered even map benefits from this sort of transformation!

Changing:

map :: (Word8 -> Word8) -> ByteString -> ByteString
map f (BS fp len) = unsafeDupablePerformIO $ withForeignPtr fp $ \a ->
    create len $ map_ 0 a
  where
    map_ :: Int -> Ptr Word8 -> Ptr Word8 -> IO ()
    map_ !n !p1 !p2
       | n >= len = return ()
       | otherwise = do
            x <- peekByteOff p1 n
            pokeByteOff p2 n (f x)
            map_ (n+1) p1 p2
{-# INLINE map #-}

to

map :: (Word8 -> Word8) -> ByteString -> ByteString
map f (BS fp len) = unsafeDupablePerformIO $ unsafeWithForeignPtr fp $ \ptr1 ->
    create len $ \ptr2 -> m ptr1 ptr2
  where
    m p1 p2 = map_ 0
      where
      map_ :: Int -> IO ()
      map_ !n
         | n >= len = return ()
         | otherwise = do
              x <- peekByteOff p1 n
              pokeByteOff p2 n (f x)
              map_ (n+1)
{-# INLINE map #-}

Gave me this on a benchmark:

old:
traversal/map                            mean 42.51 μs  ( +- 716.8 ns  )
new:
traversal/map                            mean 26.10 μs  ( +- 361.0 ns  )

Bodigrim · 2021-01-14T23:49:14Z

@Boarders Awesome! Please check that there is no regression for shorter strings. This kind of optimization can potentially lead to a (constant) increase of heap allocations. If you'd be able to accompany your PR with Core dumps of modified functions before and after, it would help to review a lot. If new Core contains join or joinrec, it is a good sign; if let or letrec, then not quite. See an example at #273 (comment) and below.

Boarders · 2021-01-15T00:04:36Z

Just quickly checked, it is faster (by a reasonable margin) on every example I have tried. I'll see what I can do about getting some core, if you have any recommended place to put it then that would help.

sjakobi · 2021-01-16T00:24:55Z

The naming is definitely confusing. We could consider changing the findIndexOrEnd name to avoid similar situations in the future.

#348 implements the renaming.

sjakobi · 2021-01-16T22:23:11Z

@Boarders, for completeness, did you check that there are no more inner loops left where we might be able to reduce the number of arguments?

Boarders · 2021-01-18T16:23:07Z

@sjakobi : The only other example I came across is the new packZipWith which is written as so:

packZipWith :: (Word8 -> Word8 -> Word8) -> ByteString -> ByteString -> ByteString
packZipWith f (BS fp l) (BS fq m) = unsafeDupablePerformIO $
    withForeignPtr fp $ \a ->
    withForeignPtr fq $ \b ->
    create len $ go a b
  where
    go p1 p2 = zipWith_ 0
      where
        zipWith_ :: Int -> Ptr Word8 -> IO ()
        zipWith_ !n !r
           | n >= len = return ()
           | otherwise = do
                x <- peekByteOff p1 n
                y <- peekByteOff p2 n
                pokeByteOff r n (f x y)
                zipWith_ (n+1) r

    len = min l m
{-# INLINE packZipWith #-}

That could be re-written to:

packZipWith :: (Word8 -> Word8 -> Word8) -> ByteString -> ByteString -> ByteString
packZipWith f (BS fp l) (BS fq m) = unsafeDupablePerformIO $
    withForeignPtr fp $ \srcPtr1 ->
    withForeignPtr fq $ \srcPtr2 ->
    create len $ \destPtr -> go srcPtr1 srcPtr2 destPtr
  where
    go p1 p2 dest = zipWith_ 0
      where
        zipWith_ :: Int -> IO ()
        zipWith_ !n
           | n >= len = return ()
           | otherwise = do
                x <- peekByteOff p1 n
                y <- peekByteOff p2 n
                pokeByteOff dest n (f x y)
                zipWith_ (n+1)

    len = min l m
  {-# INLINE packZipWith #-}

This one leads to better performance when not inlined (26.63 μs vs 37.06 μs) but worse performance when inlined (9.011 μs vs 5.940 μs). That is mysterious to me but I thought it better to leave it out of the issue. If someone wishes to investigate more then it might be interesting to do so.

sjakobi · 2021-01-18T18:08:59Z

@Boarders Indeed it would be nice to get a better understanding of the performance implications here. I've included packZipWith as a TODO-item in #350.

sjakobi added the performance label Dec 19, 2020

Boarders mentioned this issue Jan 15, 2021

Optimize findIndex, findIndexEnd and map #347

Merged

Bodigrim linked a pull request Jan 16, 2021 that will close this issue

Optimize findIndex, findIndexEnd and map #347

Merged

Bodigrim added this to the 0.11.1.0 milestone Jan 16, 2021

Bodigrim closed this as completed in #347 Jan 16, 2021

sjakobi mentioned this issue Jan 18, 2021

More inner loops with more arguments than necessary #350

Open

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Try optimizing loops in `Data.ByteString.findIndex[End]` in the style of #273 #338

Try optimizing loops in `Data.ByteString.findIndex[End]` in the style of #273 #338

sjakobi commented Dec 19, 2020

Boarders commented Jan 13, 2021

Bodigrim commented Jan 13, 2021

Boarders commented Jan 14, 2021

sjakobi commented Jan 14, 2021

Boarders commented Jan 14, 2021

sjakobi commented Jan 14, 2021

Boarders commented Jan 14, 2021

Boarders commented Jan 14, 2021

Bodigrim commented Jan 14, 2021

Boarders commented Jan 15, 2021

sjakobi commented Jan 16, 2021

sjakobi commented Jan 16, 2021

Boarders commented Jan 18, 2021 •

edited

Loading

sjakobi commented Jan 18, 2021

Try optimizing loops in Data.ByteString.findIndex[End] in the style of #273 #338

Try optimizing loops in Data.ByteString.findIndex[End] in the style of #273 #338

Comments

sjakobi commented Dec 19, 2020

Boarders commented Jan 13, 2021

Bodigrim commented Jan 13, 2021

Boarders commented Jan 14, 2021

sjakobi commented Jan 14, 2021

Boarders commented Jan 14, 2021

sjakobi commented Jan 14, 2021

Boarders commented Jan 14, 2021

Boarders commented Jan 14, 2021

Bodigrim commented Jan 14, 2021

Boarders commented Jan 15, 2021

sjakobi commented Jan 16, 2021

sjakobi commented Jan 16, 2021

Boarders commented Jan 18, 2021 • edited Loading

sjakobi commented Jan 18, 2021

Try optimizing loops in `Data.ByteString.findIndex[End]` in the style of #273 #338

Try optimizing loops in `Data.ByteString.findIndex[End]` in the style of #273 #338

Boarders commented Jan 18, 2021 •

edited

Loading