-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
replaces findIndices with elemIndices #15
Conversation
findIndices vastly inefficient as currently implemented in bytestring library elemIndices is much faster for finding only newlines
Modifies internal logic of Data.ByteString.Streaming.Char8.lineSplit loops to call Data.ByteString.count only once for any given chunk, and repeated decrement its value by the number of lines consumed per iteration over the same chunk. This change should improve the overall performance of `lineSplit n` when n is small, by avoiding quadratic O(m^2/n) performance resulting from calling `B.count` every pass over each individual chunk.
TL;DR: THANKS! This PR is substantial progress, and never does worse and often does much better than the head master commit. I think that perhaps the next step should be taken, and the I made some measurements of the resulting performance on the head commit of the master branch, this PR (15), and a version that (as proposed) entirely does away with calling Time to read and output to /dev/null 10million lines of ~126 byte lines, split
Time to read and output to /dev/null 10million empty lines, split into chunks
No major surprises, the PR#15 code and original code do better on a worst-case stream of just empty lines, but only when the requested line cluster size is as big or bigger than the streaming bytestring chunk size, in which case they're able to take advantage of the call to My instinct is that twice the performance on typical files with longer lines compares favourably against 1.6x the performance on a stream of empty lines seen only when the requested line grouping ~32k lines at a time or more! For the record, here's the benchmark code: module Main (main) where
import qualified Streaming as S
import qualified Data.ByteString.Streaming.Char8 as Q
import qualified Data.List as L
import System.Environment (getArgs)
defaultLineCount :: Int
defaultLineCount = 1
main :: IO ()
main = do
n <- maybe defaultLineCount (read . fst) <$> L.uncons <$> getArgs
S.mapsM_ Q.putStr
$ Q.lineSplit n
$ Q.stdin At this point feedback from the maintainers, et. al., would be great. Cc: @Bodigrim, @cartazio, @sjakobi, @andrewthad |
Anyone? |
I'm not involved in the maintenance of this library anymore. That aside, I think that this seems like a good improvement. The test suite has a few tests for splitting on newlines, so if it's still passing, the new implementation is probably correct. |
Who are the current (active?) maintainers? It would be nice to see this resolved. The key question is whether to go with this PR as-is, or to take a further step in the direction of eliminating the line count entirely, yielding broadly another ~2x performance improvement, at the cost of some loss of performance on vast streams of just newlines consumed with split counts larger than the number of lines (i.e. bytes) per chunk. My take is that the "No Count" variant is the logical next step. |
Hi @vdukhovni , sorry that it took a while to respond. I like where this PR is going, and think the It can make a Hackage release as soon as all that is ready. |
Done in #18 I think the tests however could use some upkeep. They're using rather dated |
Oh it looks like I haven't given this repo the usual multi-LTS Github CI setup I use. I'll do that. |
As #18 has already been merged, and this PR is a variant solution to the same problem addressed in that PR, I am closing both this PR and the associated issue (assuming the latter has not already been resolved). |
findIndices vastly inefficient as currently implemented in bytestring
library
elemIndices is much faster for finding only newlines
should resolve #14