Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

replaces findIndices with elemIndices #15

Closed
wants to merge 2 commits into from

Conversation

archaephyrryx
Copy link

findIndices vastly inefficient as currently implemented in bytestring
library

elemIndices is much faster for finding only newlines

should resolve #14

findIndices vastly inefficient as currently implemented in bytestring
library

elemIndices is much faster for finding only newlines
Data/ByteString/Streaming/Char8.hs Outdated Show resolved Hide resolved
Modifies internal logic of Data.ByteString.Streaming.Char8.lineSplit
loops to call Data.ByteString.count only once for any given chunk,
and repeated decrement its value by the number of lines consumed
per iteration over the same chunk.

This change should improve the overall performance of `lineSplit n`
when n is small, by avoiding quadratic O(m^2/n) performance resulting
from calling `B.count` every pass over each individual chunk.
@vdukhovni
Copy link
Contributor

vdukhovni commented Aug 29, 2020

TL;DR: THANKS! This PR is substantial progress, and never does worse and often does much better than the head master commit. I think that perhaps the next step should be taken, and the B.count code removed entirely, at least until it can be improved and shown to actually deliver speedups in realistic use cases.

I made some measurements of the resulting performance on the head commit of the master branch, this PR (15), and a version that (as proposed) entirely does away with calling B.count (the "No Count" columns below). The measurements were made with lineSplit counts of 1, 256, 4096, 32768 and 524288 (16 x 32768) lines, and for a "normal" file with ~10m lines averaging 126 bytes and a worst-case file consisting of just empty lines. The results are below:

Time to read and output to /dev/null 10million lines of ~126 byte lines, split
into chunks of 1, 256, 4096, 32768 and 524288 lines via lineSplit.

SplitCount Release PR#15 No Count
1 131.331 4.450 3.630
256 9.480 1.651 0.771
4096 1.941 1.421 0.711
32768 1.481 1.390 0.700
524288 1.421 1.400 0.690

Time to read and output to /dev/null 10million empty lines, split into chunks
of 1, 256, 4096, 32768 and 524288 lines (i.e. bytes) via lineSplit.

SplitCount Release PR#15 No Count
1 111.140 1.661 1.580
256 0.770 0.300 0.170
4096 0.320 0.280 0.160
32768 0.120 0.110 0.160
524288 0.100 0.101 0.160

No major surprises, the PR#15 code and original code do better on a worst-case stream of just empty lines, but only when the requested line cluster size is as big or bigger than the streaming bytestring chunk size, in which case they're able to take advantage of the call to B.count to save some work.

My instinct is that twice the performance on typical files with longer lines compares favourably against 1.6x the performance on a stream of empty lines seen only when the requested line grouping ~32k lines at a time or more!

For the record, here's the benchmark code:

module Main (main) where

import qualified Streaming as S
import qualified Data.ByteString.Streaming.Char8 as Q
import qualified Data.List as L
import System.Environment (getArgs)

defaultLineCount :: Int
defaultLineCount = 1

main :: IO ()
main = do
    n <- maybe defaultLineCount (read . fst) <$> L.uncons <$> getArgs
    S.mapsM_ Q.putStr
        $ Q.lineSplit n
        $ Q.stdin

At this point feedback from the maintainers, et. al., would be great.

Cc: @Bodigrim, @cartazio, @sjakobi, @andrewthad

@vdukhovni
Copy link
Contributor

Anyone?

@andrewthad
Copy link
Contributor

I'm not involved in the maintenance of this library anymore. That aside, I think that this seems like a good improvement. The test suite has a few tests for splitting on newlines, so if it's still passing, the new implementation is probably correct.

@vdukhovni
Copy link
Contributor

I'm not involved in the maintenance of this library anymore. That aside, I think that this seems like a good improvement. The test suite has a few tests for splitting on newlines, so if it's still passing, the new implementation is probably correct.

Who are the current (active?) maintainers? It would be nice to see this resolved. The key question is whether to go with this PR as-is, or to take a further step in the direction of eliminating the line count entirely, yielding broadly another ~2x performance improvement, at the cost of some loss of performance on vast streams of just newlines consumed with split counts larger than the number of lines (i.e. bytes) per chunk.

My take is that the "No Count" variant is the logical next step.

@vdukhovni
Copy link
Contributor

Cc: @chessai, @fosskers any feedback on this PR? It seems the package has been stable and not in need of any maintenance for some time, which is often a feature (stable, well tested code, ...), but it would still be nice to get some dusty corners attended to...

@fosskers
Copy link
Collaborator

fosskers commented Sep 4, 2020

Hi @vdukhovni , sorry that it took a while to respond. I like where this PR is going, and think the No Count variant is the direction we should move in. As @andrewthad said, if the tests are passing and your benches show good numbers, then it should be an easy merge.

It can make a Hackage release as soon as all that is ready.

@vdukhovni
Copy link
Contributor

Hi @vdukhovni , sorry that it took a while to respond. I like where this PR is going, and think the No Count variant is the direction we should move in. As @andrewthad said, if the tests are passing and your benches show good numbers, then it should be an easy merge.

It can make a Hackage release as soon as all that is ready.

Done in #18

I think the tests however could use some upkeep. They're using rather dated stack snapshots, and I think cabal would be a better choice, and would ideally test GHC 8.0, 8.2, 8.4, 8.6, 8.8 and 8.10 and soon also 9.0, as well as any 7.x versions still supported, ...
But otherwise, and perhaps with less effort a more comprehensive set of stack LTS snaphots and a more recent nightly.

@fosskers
Copy link
Collaborator

fosskers commented Sep 6, 2020

Oh it looks like I haven't given this repo the usual multi-LTS Github CI setup I use. I'll do that.

@archaephyrryx
Copy link
Author

As #18 has already been merged, and this PR is a variant solution to the same problem addressed in that PR, I am closing both this PR and the associated issue (assuming the latter has not already been resolved).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

lineSplit inefficiency using ByteString findIndices
4 participants