Faster `c_count`? Perhaps a GHC rather than `bytestring` issue? #274

vdukhovni · 2020-08-27T23:27:59Z

Recent performance issues in streaming-bytestring's lineSplit and bytestring's findIndices indirectly lead to observations about the relative performance of the C memchr(3) vs. native byte-by-byte comparison loops of the kind found in bytestring's c_count function.

In particular, it is at first surprising that in a 32K block with not unreasonably short lines, counting the newlines via repeated calls to memchr() turns out to noticeably outperform a single call to fps_count(). The reason is of course that GCC's or clang's memchr() is not just a naïve byte-by-byte loop. It may use vector instructions of the CPU or clever portable tricks to efficiently check wether a 64-bit-aligned 64-bit block contains any instances of a given byte.

Some of these techniques, can be found under an MIT license in repos with code extracted from Rust, such as:

https://github.com/Freaky/fast-bytecount

so could perhaps be integrated (with proper attribution, and perhaps after checking with the authors) into bytestring or GHC. Which brings me to the real question:

Should fast memchr() and counting occurrences of a given byte, ... belong in bytestring, or are they perhaps properly better implemented as primops in GHC itself, and not require FFI calls? GHC might then support these for ByteArray# and/or Ptr Word8?

On some level I feel that CPU-optimised primitives of this sort belong in the compiler, and that libraries just need to use the provided primitives instead of rolling their own. There's some precedent for this with some of the SIMD vector primitives and the bit-level "popcount" primitives found in GHC.Exts.

On the other hand if GHC for some reason is not the right vehicle for per-CPU-optimised primitives of this sort, then perhaps bytestring might at least include the clever portable C optimised versions, but I don't think that bytestring can reasonably support assembly optimisations targeted at given CPUs or LLVM code generation, ...

So this issue is fairly open-ended. Is this worth thinking about? And if so is this a GHC topic or a bytestring topic?

Cc: @bgamari

The text was updated successfully, but these errors were encountered:

sjakobi · 2020-08-28T07:02:20Z

Somewhat related: #202, #109.

(CC @0xd34df00d)

vdukhovni · 2020-08-28T08:59:24Z

Somewhat related: #202, #109.

(CC @0xd34df00d)

Thanks! More than somewhat related. :-) Basically, the same questions really, but what's not asked in those issues as emphatically is whether this really belongs in bytestring or not. My instinct is these primitives are properly the domain of the compiler, or if not primops then at least some GHC.Ext functions in base. Maintaining CPU-architecture-optimised code in bytestring does not feel like the right place (or likely maintainer skillset).

Bodigrim · 2021-05-15T20:16:37Z

Since #202 has landed into master, I think the remaining part belongs to GHC topics and is currently out of scope for bytestring. FWIW I'd be happy to see counting primitives provided by GHC itself.

sjakobi added the performance label Aug 28, 2020

vdukhovni mentioned this issue Sep 17, 2020

Make safe/unsafe foreign calls size dependent #282

Open

Bodigrim linked a pull request May 15, 2021 that will close this issue

Implement count with SSE 4.2 and AVX2 #202

Merged

Bodigrim closed this as completed May 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster `c_count`? Perhaps a GHC rather than `bytestring` issue? #274

Faster `c_count`? Perhaps a GHC rather than `bytestring` issue? #274

vdukhovni commented Aug 27, 2020

sjakobi commented Aug 28, 2020

vdukhovni commented Aug 28, 2020

Bodigrim commented May 15, 2021

Faster c_count? Perhaps a GHC rather than bytestring issue? #274

Faster c_count? Perhaps a GHC rather than bytestring issue? #274

Comments

vdukhovni commented Aug 27, 2020

sjakobi commented Aug 28, 2020

vdukhovni commented Aug 28, 2020

Bodigrim commented May 15, 2021

Faster `c_count`? Perhaps a GHC rather than `bytestring` issue? #274

Faster `c_count`? Perhaps a GHC rather than `bytestring` issue? #274