replaces findIndices with elemIndices #15

archaephyrryx · 2020-08-26T21:38:30Z

findIndices vastly inefficient as currently implemented in bytestring
library

elemIndices is much faster for finding only newlines

should resolve #14

findIndices vastly inefficient as currently implemented in bytestring library elemIndices is much faster for finding only newlines

Data/ByteString/Streaming/Char8.hs

Modifies internal logic of Data.ByteString.Streaming.Char8.lineSplit loops to call Data.ByteString.count only once for any given chunk, and repeated decrement its value by the number of lines consumed per iteration over the same chunk. This change should improve the overall performance of `lineSplit n` when n is small, by avoiding quadratic O(m^2/n) performance resulting from calling `B.count` every pass over each individual chunk.

vdukhovni · 2020-08-29T04:37:46Z

TL;DR: THANKS! This PR is substantial progress, and never does worse and often does much better than the head master commit. I think that perhaps the next step should be taken, and the B.count code removed entirely, at least until it can be improved and shown to actually deliver speedups in realistic use cases.

I made some measurements of the resulting performance on the head commit of the master branch, this PR (15), and a version that (as proposed) entirely does away with calling B.count (the "No Count" columns below). The measurements were made with lineSplit counts of 1, 256, 4096, 32768 and 524288 (16 x 32768) lines, and for a "normal" file with ~10m lines averaging 126 bytes and a worst-case file consisting of just empty lines. The results are below:

Time to read and output to /dev/null 10million lines of ~126 byte lines, split
into chunks of 1, 256, 4096, 32768 and 524288 lines via lineSplit.

SplitCount	Release	PR#15	No Count
1	131.331	4.450	3.630
256	9.480	1.651	0.771
4096	1.941	1.421	0.711
32768	1.481	1.390	0.700
524288	1.421	1.400	0.690

Time to read and output to /dev/null 10million empty lines, split into chunks
of 1, 256, 4096, 32768 and 524288 lines (i.e. bytes) via lineSplit.

SplitCount	Release	PR#15	No Count
1	111.140	1.661	1.580
256	0.770	0.300	0.170
4096	0.320	0.280	0.160
32768	0.120	0.110	0.160
524288	0.100	0.101	0.160

No major surprises, the PR#15 code and original code do better on a worst-case stream of just empty lines, but only when the requested line cluster size is as big or bigger than the streaming bytestring chunk size, in which case they're able to take advantage of the call to B.count to save some work.

My instinct is that twice the performance on typical files with longer lines compares favourably against 1.6x the performance on a stream of empty lines seen only when the requested line grouping ~32k lines at a time or more!

For the record, here's the benchmark code:

module Main (main) where

import qualified Streaming as S
import qualified Data.ByteString.Streaming.Char8 as Q
import qualified Data.List as L
import System.Environment (getArgs)

defaultLineCount :: Int
defaultLineCount = 1

main :: IO ()
main = do
    n <- maybe defaultLineCount (read . fst) <$> L.uncons <$> getArgs
    S.mapsM_ Q.putStr
        $ Q.lineSplit n
        $ Q.stdin

At this point feedback from the maintainers, et. al., would be great.

Cc: @Bodigrim, @cartazio, @sjakobi, @andrewthad

vdukhovni · 2020-08-31T10:42:05Z

Anyone?

andrewthad · 2020-08-31T12:40:14Z

I'm not involved in the maintenance of this library anymore. That aside, I think that this seems like a good improvement. The test suite has a few tests for splitting on newlines, so if it's still passing, the new implementation is probably correct.

vdukhovni · 2020-08-31T22:29:49Z

I'm not involved in the maintenance of this library anymore. That aside, I think that this seems like a good improvement. The test suite has a few tests for splitting on newlines, so if it's still passing, the new implementation is probably correct.

Who are the current (active?) maintainers? It would be nice to see this resolved. The key question is whether to go with this PR as-is, or to take a further step in the direction of eliminating the line count entirely, yielding broadly another ~2x performance improvement, at the cost of some loss of performance on vast streams of just newlines consumed with split counts larger than the number of lines (i.e. bytes) per chunk.

My take is that the "No Count" variant is the logical next step.

vdukhovni · 2020-09-01T00:57:45Z

Cc: @chessai, @fosskers any feedback on this PR? It seems the package has been stable and not in need of any maintenance for some time, which is often a feature (stable, well tested code, ...), but it would still be nice to get some dusty corners attended to...

fosskers · 2020-09-04T21:53:17Z

Hi @vdukhovni , sorry that it took a while to respond. I like where this PR is going, and think the No Count variant is the direction we should move in. As @andrewthad said, if the tests are passing and your benches show good numbers, then it should be an easy merge.

It can make a Hackage release as soon as all that is ready.

vdukhovni · 2020-09-05T04:59:33Z

Hi @vdukhovni , sorry that it took a while to respond. I like where this PR is going, and think the No Count variant is the direction we should move in. As @andrewthad said, if the tests are passing and your benches show good numbers, then it should be an easy merge.

It can make a Hackage release as soon as all that is ready.

Done in #18

I think the tests however could use some upkeep. They're using rather dated stack snapshots, and I think cabal would be a better choice, and would ideally test GHC 8.0, 8.2, 8.4, 8.6, 8.8 and 8.10 and soon also 9.0, as well as any 7.x versions still supported, ...
But otherwise, and perhaps with less effort a more comprehensive set of stack LTS snaphots and a more recent nightly.

fosskers · 2020-09-06T22:10:46Z

Oh it looks like I haven't given this repo the usual multi-LTS Github CI setup I use. I'll do that.

archaephyrryx · 2020-09-07T20:14:36Z

As #18 has already been merged, and this PR is a variant solution to the same problem addressed in that PR, I am closing both this PR and the associated issue (assuming the latter has not already been resolved).

replaces findIndices with elemIndices

589e724

findIndices vastly inefficient as currently implemented in bytestring library elemIndices is much faster for finding only newlines

Bodigrim mentioned this pull request Aug 26, 2020

FindIndices optimized using findIndex and inlining haskell/bytestring#270

Merged

vdukhovni suggested changes Aug 27, 2020

View reviewed changes

Data/ByteString/Streaming/Char8.hs Outdated Show resolved Hide resolved

vdukhovni mentioned this pull request Sep 5, 2020

Improve lineSpit performance. #18

Merged

archaephyrryx closed this Sep 7, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

replaces findIndices with elemIndices #15

replaces findIndices with elemIndices #15

archaephyrryx commented Aug 26, 2020

vdukhovni commented Aug 29, 2020 •

edited

Loading

vdukhovni commented Aug 31, 2020

andrewthad commented Aug 31, 2020

vdukhovni commented Aug 31, 2020

vdukhovni commented Sep 1, 2020

fosskers commented Sep 4, 2020 •

edited

Loading

vdukhovni commented Sep 5, 2020

fosskers commented Sep 6, 2020 •

edited

Loading

archaephyrryx commented Sep 7, 2020

replaces findIndices with elemIndices #15

replaces findIndices with elemIndices #15

Conversation

archaephyrryx commented Aug 26, 2020

vdukhovni commented Aug 29, 2020 • edited Loading

vdukhovni commented Aug 31, 2020

andrewthad commented Aug 31, 2020

vdukhovni commented Aug 31, 2020

vdukhovni commented Sep 1, 2020

fosskers commented Sep 4, 2020 • edited Loading

vdukhovni commented Sep 5, 2020

fosskers commented Sep 6, 2020 • edited Loading

archaephyrryx commented Sep 7, 2020

vdukhovni commented Aug 29, 2020 •

edited

Loading

fosskers commented Sep 4, 2020 •

edited

Loading

fosskers commented Sep 6, 2020 •

edited

Loading