Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster c_count? Perhaps a GHC rather than bytestring issue? #274

Closed
vdukhovni opened this issue Aug 27, 2020 · 3 comments · Fixed by #202
Closed

Faster c_count? Perhaps a GHC rather than bytestring issue? #274

vdukhovni opened this issue Aug 27, 2020 · 3 comments · Fixed by #202

Comments

@vdukhovni
Copy link
Contributor

Recent performance issues in streaming-bytestring's lineSplit and bytestring's findIndices indirectly lead to observations about the relative performance of the C memchr(3) vs. native byte-by-byte comparison loops of the kind found in bytestring's c_count function.

In particular, it is at first surprising that in a 32K block with not unreasonably short lines, counting the newlines via repeated calls to memchr() turns out to noticeably outperform a single call to fps_count(). The reason is of course that GCC's or clang's memchr() is not just a naïve byte-by-byte loop. It may use vector instructions of the CPU or clever portable tricks to efficiently check wether a 64-bit-aligned 64-bit block contains any instances of a given byte.

Some of these techniques, can be found under an MIT license in repos with code extracted from Rust, such as:

https://github.com/Freaky/fast-bytecount

so could perhaps be integrated (with proper attribution, and perhaps after checking with the authors) into bytestring or GHC. Which brings me to the real question:

  • Should fast memchr() and counting occurrences of a given byte, ... belong in bytestring, or are they perhaps properly better implemented as primops in GHC itself, and not require FFI calls? GHC might then support these for ByteArray# and/or Ptr Word8?

On some level I feel that CPU-optimised primitives of this sort belong in the compiler, and that libraries just need to use the provided primitives instead of rolling their own. There's some precedent for this with some of the SIMD vector primitives and the bit-level "popcount" primitives found in GHC.Exts.

On the other hand if GHC for some reason is not the right vehicle for per-CPU-optimised primitives of this sort, then perhaps bytestring might at least include the clever portable C optimised versions, but I don't think that bytestring can reasonably support assembly optimisations targeted at given CPUs or LLVM code generation, ...

So this issue is fairly open-ended. Is this worth thinking about? And if so is this a GHC topic or a bytestring topic?

Cc: @bgamari

@sjakobi
Copy link
Member

sjakobi commented Aug 28, 2020

Somewhat related: #202, #109.

(CC @0xd34df00d)

@vdukhovni
Copy link
Contributor Author

Somewhat related: #202, #109.

(CC @0xd34df00d)

Thanks! More than somewhat related. :-) Basically, the same questions really, but what's not asked in those issues as emphatically is whether this really belongs in bytestring or not. My instinct is these primitives are properly the domain of the compiler, or if not primops then at least some GHC.Ext functions in base. Maintaining CPU-architecture-optimised code in bytestring does not feel like the right place (or likely maintainer skillset).

@Bodigrim
Copy link
Contributor

Since #202 has landed into master, I think the remaining part belongs to GHC topics and is currently out of scope for bytestring. FWIW I'd be happy to see counting primitives provided by GHC itself.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants