Implement `count` with SSE 4.2 and AVX2 #202

0xd34df00d · 2020-02-02T22:57:53Z

This provides a fairly significant performance boost: counting lines via BS.count 10 was taking about 1.1 s with an mmaped 1.8 gigabyte file, and now it takes 0.35 s. For comparison, length . BS.lines takes about 0.4 s (IMO count should definitely be at least as fast), and Unix wc takes about 0.3-0.4 s.

sjakobi · 2020-02-03T15:22:35Z

Interesting!

I'm wondering why count didn't use memchr already while other functions do:

bytestring/Data/ByteString.hs

Lines 1001 to 1013 in 95fe6bd

    
           split :: Word8 -> ByteString -> [ByteString] 
        
           split _ (PS _ _ 0) = [] 
        
           split w (PS x s l) = loop 0 
        
               where 
        
                   loop !n = 
        
                       let q = accursedUnutterablePerformIO $ withForeignPtr x $ \p -> 
        
                                 memchr (p `plusPtr` (s+n)) 
        
                                        w (fromIntegral (l-n)) 
        
                       in if q == nullPtr 
        
                           then [PS x (s+n) (l-n)] 
        
                           else let i = accursedUnutterablePerformIO $ withForeignPtr x $ \p -> 
        
                                          return (q `minusPtr` (p `plusPtr` s)) 
        
                                 in PS x (s+n) (i-n) : loop (i+1)

bytestring/Data/ByteString.hs

Lines 1085 to 1089 in 95fe6bd

    
           elemIndex :: Word8 -> ByteString -> Maybe Int 
        
           elemIndex c (PS x s l) = accursedUnutterablePerformIO $ withForeignPtr x $ \p -> do 
        
               let p' = p `plusPtr` s 
        
               q <- memchr p' c (fromIntegral l) 
        
               return $! if q == nullPtr then Nothing else Just $! q `minusPtr` p'

c29225b mentions using memchr for count but goes with a different implementation for some reason?!

It would be nice if you could share your benchmark but I guess that's somewhat blocked on #196.

#109 also looks related.

0xd34df00d · 2020-02-06T02:58:53Z

My benchmark is something really silly, along the lines of

import qualified Data.ByteString as BS
import System.IO.Posix.MMap

main :: IO ()
main = do
  contents <- unsafeMMapFile "/huge/file/on/tmpfs.txt"
  print $ BS.count 10 contents

The file size is about 2 gigs, and it's on a tmpfs partition with no swapping, so no IO involved. Since the file is mmaped, there's also no GC pressure at all.

Thanks for the pointer to #109 , I missed it! While I'm at this task I can just as well try to brush up my SIMD intrinsics skills and also implement a version using PCMPESTRM as suggested in one of the comments — it might indeed be beneficial, especially for strings where the character in question occurs often enough.

In case I get decent results, what's bytestring's policy on SIMD, supported architectures and adding flags for those, or C compiler dependency? IIRC given code along the lines of (I don't remember the right attribute syntax off the top of my head though)

__attribute__((target("sse42")))
void foo() { ... }

__attribute__((target("avx")))
void foo() { ... }

/* fallback */
void foo() { ... }

gcc can generate some dispatching code to select the right implementation at runtime according to the current (runtime) CPU architecture.

0xd34df00d · 2020-02-06T04:36:52Z

Alright, did some experimenting with PCMPESTRM, and the results are quite in favour of it, especially given that its performance doesn't depend on the character frequency.

When looking for '\n' (with a frequency of about 0.8%), memchr-based run time is about 311 ms. With PCMPESTRM it's a bit (but statistically significantly) lower — about 295 ms.
When looking for ' ' that occurs exactly twice as often, memchr runtime degrades to 425 ms, while PCMPESTRM stays at 295 ms.
When looking for , with frequency of about 10% (my test file is a CSV), memchr is dying at about 1150 ms, while PCMPESTRM — you guessed it — stays at 295 ms.

So I guess if every other character would be the character in question, memchr-based solution would be a no-go.

My gut feel is that with PCMPESTRM the performance is limited by CPI, so I'll play around some more to see if using 256-bit-wide registers and instructions would help, so I'm not pushing what I currently have, but just for the reference, it's

__attribute__((target("sse4.2")))
unsigned long fps_count(unsigned char *str, unsigned long len, unsigned char w) {
    __m128i pat = _mm_set1_epi8(w);

    const int mode = _SIDD_SBYTE_OPS | _SIDD_CMP_EQUAL_EACH;

    unsigned long res = 0;

    size_t i = 0; 

    for (; i < len && (intptr_t)(str + i) % 64; ++i) {
        res += str[i] == w;
    }

    for (; i < len - 64; i += 64) {
        __m128i p1 = _mm_load_si128((const __m128i*)(str + i + 16 * 0));
        __m128i p2 = _mm_load_si128((const __m128i*)(str + i + 16 * 1));
        __m128i p3 = _mm_load_si128((const __m128i*)(str + i + 16 * 2));
        __m128i p4 = _mm_load_si128((const __m128i*)(str + i + 16 * 3));
        __m128i r1 = _mm_cmpestrm(p1, 16, pat, 16, mode);
        __m128i r2 = _mm_cmpestrm(p2, 16, pat, 16, mode);
        __m128i r3 = _mm_cmpestrm(p3, 16, pat, 16, mode);
        __m128i r4 = _mm_cmpestrm(p4, 16, pat, 16, mode);
        res += _popcnt64(_mm_extract_epi64(r1, 0));
        res += _popcnt64(_mm_extract_epi64(r2, 0));
        res += _popcnt64(_mm_extract_epi64(r3, 0));
        res += _popcnt64(_mm_extract_epi64(r4, 0));
    }

    for (; i < len; ++i) {
        res += str[i] == w;
    }

    return res;
}

Let's probably start discussing how SIMD extensions-dependent things should be integrated?

sjakobi · 2020-02-06T08:15:15Z

Wow, those results with PCMPESTRM look very promising!

Let's probably start discussing how SIMD extensions-dependent things should be integrated?

Indeed – I can't be of much help there though! :)

Ping @hvr, @dcoutts!

@0xd34df00d You can also try garnering more feedback on the libraries mailing list.

0xd34df00d · 2020-02-13T21:27:01Z

@sjakobi spent some more time at this, and AVX2 looks even better (see the table at https://github.com/0xd34df00d/counting-chars)!

I made some effort to make sure this still works on machines that don't have AVX2 (or even SSE4.2), but I'm not sure about the treatment of compilers other than gcc and clang, so ideally some other pair of eyes better suited for this sort of stuff than me would take a look at this.

Bodigrim · 2020-05-10T21:44:47Z

@0xd34df00d this is a very inspiring result! Could you please raise a question about SIMD policies on Libraries maillist?

cartazio · 2020-05-17T18:45:11Z

THIS IS REAALLY COOL DEADBEEF, though of course i need to read it much close :)

cartazio · 2020-05-17T21:00:56Z

for what its worth, theres no simd optimization guidelines for cbits parts of haskell libraries, and the main bottleneck /engineering challenge is having the right dynamic code selection piece, which this PR seems to do!

one important question is always: whats the overhead of the dynamic dispatch on smaller inputs? (eg, is it better to only do the simd when inputs are above some size threshold?) plus stuff about which cpu generations actually benefit (which is tricky!)

cartazio · 2020-05-17T22:34:56Z

cbits/fpstring.c

+    return fps_count_naive(str, len, w);
+#else
+    if (len <= 1024) {
+        return fps_count_naive(str, len, w);


oh sweet, it does do the fallback for small inputs :)

cartazio · 2020-05-17T23:01:40Z

your dynamic dispatch looks correct, in isolation, but, if i'm reading the gcc/clang docs about function multi versioning, i think you're either using it wrong or underusing it. Or i need to read/play with those more and understand the tricks myself.

reference urls:
https://gcc.gnu.org/wiki/FunctionMultiVersioning
https://llvm.org/devmtg/2014-10/Slides/Christopher-Function%20Multiversioning%20Talk.pdf
https://hannes.hauswedell.net/post/2017/12/09/fmv/
https://clang.llvm.org/docs/AttributeReference.html#cpu-dispatch

https://gcc.gnu.org/onlinedocs/gcc-10.1.0/gcc/Function-Multiversioning.html#Function-Multiversioning

to be clear: my fuzzy understanding is that if the file was compiled in c++ mode, then the different attributes but same function name would do an auto select? (and here you're doing a dynamic dispatch via c either way, so they have deliberately different names). --- also clang apparently has different syntax for the attribute stuff..

to be clear, i literally learned this was a feature today :) , so thats great!

cartazio · 2020-05-17T23:08:05Z

derp https://releases.llvm.org/7.0.0/tools/clang/docs/AttributeReference.html#target-gnu-target is the clang analogue?

cartazio · 2020-05-17T23:10:27Z

https://releases.llvm.org/10.0.0/tools/clang/docs/AttributeReference.html#target

0xd34df00d · 2020-05-22T20:03:20Z

Yay, thank you @cartazio for taking a look at this!

Regarding the dynamic dispatch — yes, gcc surely can and will generate the dispatching code from these attributes. clang, on the other hand, doesn't seem to always do this — at least, in my experience I had to manually write a similar dispaching code about 3 years ago. Maybe things got better since then!

Of course, I'm all for dropping manual detection and dispatch (the less code, the better), but I'm not entirely comfortable doing so given the experience above.

Also, speaking of the overhead of dynamic dispatch — in any case it only happens once per OS/C thread (or what's the right Haskell-lingo for this), and then it's a matter of a single comparison of a thread-local function pointer against 0, which a CPU's branch predictor will hopefully very quickly learn. My primary reason for using the naive function for smaller inputs is to avoid all the overhead for unaligned prologue/epilogue handlers of the SIMD versions.

cartazio · 2020-05-22T20:39:36Z

no, no, it looks fine, i'm just not often given the experience of seeing well written dispatching portable simd.

I think that you dont need to do the attribute annotation if you're doing the manual dispatch? otoh, i literally have zero experience with this stuff so not sure :) .

i think the naive fall back for smaller inputs is the right choice though, so that seems fine to me.

cartazio · 2020-05-22T21:18:46Z

lets make sure the build goes green and this code works with both recent gcc and clang both! :)

sjakobi · 2020-05-22T21:27:17Z

A good compatibility test would be to build this on GHC's CI infrastructure. I've meant to properly document this for a while, but essentially you

Fork GHC on https://gitlab.haskell.org/ghc/ghc
Update the libraries/bytestring submodule to point at your fork of bytestring
Start a CI pipeline for your GHC branch at https://gitlab.haskell.org/username/ghc/pipelines

cartazio · 2020-05-22T21:34:22Z

@0xd34df00d some build errors you should fix ;)


cbits/fpstring.c:214:9: error:
606     warning: implicit declaration of function ‘__get_cpuid_count’ [-Wimplicit-function-declaration]
607         if (__get_cpuid_count(7, 0, &eax, &ebx, &ecx, &edx)) {
608             ^
609    |
610214 |     if (__get_cpuid_count(7, 0, &eax, &ebx, &ecx, &edx)) {
611    |         ^
612cbits/fpstring.c: In function ‘select_fps_simd_impl’:
613
614cbits/fpstring.c:214:9: error:
615     warning: implicit declaration of function ‘__get_cpuid_count’ [-Wimplicit-function-declaration]
616         if (__get_cpuid_count(7, 0, &eax, &ebx, &ecx, &edx)) {
617             ^
618    |
619214 |     if (__get_cpuid_count(7, 0, &eax, &ebx, &ecx, &edx)) {

cartazio · 2020-05-22T21:35:13Z

i think the build is erroring there, but idk :)

0xd34df00d · 2020-05-22T21:37:13Z

@cartazio,

I think that you dont need to do the attribute annotation if you're doing the manual dispatch?

That depends on the compiler flags like -march (I think gcc and clang behave exactly the same here). If gcc/clang generate code for, say, generic x86_64 machine, long story short, they don't know they are allowed to use SSE 4.2 or AVX, and they will error out.

Re the build error, that's interesting, I'll take a look! Might be a different standard lib with different intrinsics exposed. Are you building on mac by any chance?

@sjakobi, thanks for the instructions, will do!

Bodigrim · 2020-08-21T18:17:48Z

@0xd34df00d Travis build complains about undefined __get_cpuid_count.

Bodigrim · 2021-01-06T19:58:50Z

@0xd34df00d are you still interested in this PR?

Bodigrim · 2021-01-09T16:49:37Z

__get_cpuid_count requires gcc-6 and must be guarded by #if __GNUC__ > 5.
The patch does not compile on MacOS and FreeBSD with argument to '__builtin_ia32_pcmpestrm128' must be a constant integer failures. That's because clang is stricter than gcc.

These are probably solvable issues, but such amount of conditional compilation makes me uncomfortable to proceed. We are not in a position to test bytestring build against all possible combinations of C compilers and platforms.

@vdukhovni what's your take?

0xd34df00d · 2021-01-09T17:13:54Z

@Bodigrim sorry for long reply, this completely fell out of my attention span. I'm definitely still interested!

Frankly, neither am I comfortable with such amount of platforms that I cannot test this on (especially for such an important and foundational library). Short-term, seems like the best way is to have a cabal flag, controlling whether the hand-written SIMD implementation is used for this function (and possibly others), and disabled by default.

What I'd like to see as an ideal long-term solution is this function implemented purely in terms of GHC's SIMD primops. In fact, I tried that, but some things are missing in the set of available primops (namely, there's an undefined there in place of something, but I did that a while ago so I don't remember what exactly is missing off the top of my head).

Bodigrim · 2021-01-09T20:19:00Z

Short-term, seems like the best way is to have a cabal flag, controlling whether the hand-written SIMD implementation is used for this function (and possibly others), and disabled by default.

Besides usability concerns (switching a flag of bytestring will cause everything to rebuild, including half of boot libraries), I'm unsure about the intention. Users care less about performance on a local machine than in production, but that's exactly where they have much less control over environment.

Supporting an additional cabal flag is very costly from maintenance perspective. More configurations to test, more potential build failures, more support.

I'm not opposed to SIMD in bytestring in general. Could your patch be rewritten in a more platform-independent way, possibly sacrificing some performance gains? CC @ethercrow who worked on #310.

0xd34df00d · 2021-05-02T21:54:01Z

Ok, implemented this with C11's call_once. Although a bad libc implementation might have this function fairly unoptimized, it's only used for big enough strings, where the actual counting time much exceeds call_once. Although atomics almost certainly wouldt be faster, I'd much rather use a library function designed specifically for this task to avoid placing the burden of reasoning about atomics and the right memory orderings on myself and the next reader of the code.

vdukhovni · 2021-05-02T22:13:45Z

Well, C11 call_once may not be terribly portable. For example, on FreeBSD 12.2 it requires linking with -lstdthreads. On Fedora 31 it requires linking with -lpthread. Either way, it is not found in just libc.

On MacOS "Big Sur", there is no <threads.h> header file.

0xd34df00d · 2021-05-02T22:23:51Z

I guess this means POSIX's pthread_once is also not an option. Welp, atomics it is then!

vdukhovni · 2021-05-02T22:47:45Z

Well, pthread_once is an option on Unix-like systems, assuming explicit linking with -lpthread is not an obstacle. It is at least more broadly available, but then perhaps you run into an issue with Windows...

My opinion remains that these sort of primitives belong in GHC itself as new priomops, and not so much in bytestring, but it seems I'm a minority view on this...

Bodigrim · 2021-05-02T23:50:28Z

Well, I've heard talks about vectorized operations in GHC for years without much fruition. Even if suddenly there emerges a champion to push this stuff forward, it would still take a long time before bytestring could use it. If we can speed up things without a (potentially infinite) delay, it's a right to do so.

My personal opinion is that vectorized operations would not find way into GHC primitives until there accumulates a critical mass of libraries employing them. Then it would be easy to make a case in favor of.

Turns out a conforming C implementation might be running on a platform with threads, but with no atomics in the C stdlib.

0xd34df00d · 2021-05-08T17:16:33Z

@vdukhovni I skimmed superficially through what GHC has on SIMD primops some time ago, and, while I more than agree with your sentiment for multiple reasons, I see several obstacles (in fact, I tried writing the above with primops, and stumbled upon a few missing ones).

Firstly, a more technical one: if I understand correctly, currently GHC only allows SIMD primops from GHC.Prim when building with the llvm codegen, which, in this case, means that -fllvm needs to be turned on when building bytestring. Of course, I'm sure that NCG can also be tought about the primops, so that's more of a technicality (that still needs to be accounted for, though).

Secondly, a more conceptual one: I'm not sure how to expose all the interesting instructions in an architecture-agnostic way. The usual things like vector addition are easy — they're present in pretty much any SIMD extension set with the same semantics. But if we are starting to consider things like cmpestrm, that becomes entirely non-obvious — I bet ARM doesn't have a direct substitute for this instruction.

One way to solve the latter is by allowing the user to define their own primops, that might look like

cmpestrm# :: Word8x16# -> Word8x16# -> Imm8# -> Word8x16#
cmpestrm# s1 s2 imm8 = [ghcMagic| pcmpestrm s1 s2 imm8 |]

or something along those lines (where Imm8# stands for some immediate value known at compile-time).

Although I'm not sure the depth of the can of worms it opens: in fact, the above definition should also accept lengths that shall be stored in eax/rax and edx/rdx (that's the contract of pcmpestrm), so it's really a compound instruction, and the codegen needs to know those registers get clobbered.

Bodigrim · 2021-05-08T17:58:10Z

Benchmarks look impressive, 70-85% faster for long strings!
If there are no further suggestions (ping @hsyl20), I'll merge it in a couple of days.

vdukhovni · 2021-05-08T18:09:59Z

Secondly, a more conceptual one: I'm not sure how to expose all the interesting instructions in an architecture-agnostic way. The usual things like vector addition are easy — they're present in pretty much any SIMD extension set with the same semantics. But if we are starting to consider things like cmpestrm, that becomes entirely non-obvious — I bet ARM doesn't have a direct substitute for this instruction.

What I had in mind is not so much primops for non-portable low-level instructions, but in fact direct support for higher-level primitives (like memchr(), ...) in GHC via appropriate low-level primitives when available, or naive loops otherwise. These should work not only for ByteString but also for ByteArray...

cartazio · 2021-05-08T19:41:43Z

I agree with deadbeef, good simd api design is out of reach for ghc any time soon.

…

On Sat, May 8, 2021 at 2:10 PM Viktor Dukhovni ***@***.***> wrote: Secondly, a more conceptual one: I'm not sure how to expose all the interesting instructions in an architecture-agnostic way. The usual things like vector addition are easy — they're present in pretty much any SIMD extension set with the same semantics. But if we are starting to consider things like cmpestrm <https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm_cmpestrm&expand=835>, that becomes entirely non-obvious — I bet ARM doesn't have a direct substitute for this instruction. What I had in mind is not so much primops for non-portable low-level instructions, but in fact direct support for higher-level primitives (like memchr(), ...) in GHC via appropriate low-level primitives when available, or naive loops otherwise. These should work not only for ByteString but also for ByteArray... — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#202 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAABBQVZSASYBQUOIWCOUPLTMV5ANANCNFSM4KO4U6PQ> .

Bodigrim · 2021-05-15T19:32:19Z

Thanks, @0xd34df00d!

* Implement SSE4.2 and AVX2-based variants of the `count` * Address PR comments wrt SIMD-based character counting * Add test for `count` on longer strings * Add benchmarks for strict bytestring's count method * Fix tabs * Use C11's call_once instead of TLS for choosing counting SIMD function * Build cbits with -std=c11 * Add a point to `count` benchmarks slightly above the SIMD threshold * Initialize the `count` SIMD ptr with atomics * Enable `count` SIMD support only if atomics are available Turns out a conforming C implementation might be running on a platform with threads, but with no atomics in the C stdlib.

chrisdone · 2021-05-16T20:42:02Z

I wonder about the performance of count combined with par, with the 1GB file split in half and processed in parallel. 350ms is a long time for my other cores to be doing nothing. 🤔

cartazio · 2021-05-17T00:44:12Z

So if bytestring length is greater than say 100000 or 10_000 or something we should call a safe ffi version?

…

On Sun, May 16, 2021 at 4:42 PM Chris Done ***@***.***> wrote: I wonder about the performance of count combined with par, with the 1GB file split in half and processed in parallel. 350ms is a long time for my other cores to be doing nothing. 🤔 — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#202 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAABBQW2WKI6KEPQRP7PEO3TOAU2PANCNFSM4KO4U6PQ> .

vdukhovni · 2021-05-17T01:38:45Z

So if bytestring length is greater than say 100000 or 10_000 or something we should call a safe ffi version?
…
On Sun, May 16, 2021 at 4:42 PM Chris Done @.***> wrote: I wonder about the performance of count combined with par, with the 1GB file split in half and processed in parallel. 350ms is a long time for my other cores to be doing nothing. 🤔 — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#202 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAABBQW2WKI6KEPQRP7PEO3TOAU2PANCNFSM4KO4U6PQ .

Wouldn't that conflict with the push the to us unpinned memory?

cartazio · 2021-05-17T03:37:38Z

Nope. It’s safe precisely because it’s pinned ! Alternatively, we could split it into a sequence of slices that yield back to hs to call per slice. My personal rule of thumb is to make sure ffi things that take more than 10 microseconds get split up to play nicer with the scheduler On Sun, May 16, 2021 at 9:39 PM Viktor Dukhovni ***@***.***> wrote:

…

So if bytestring length is greater than say 100000 or 10_000 or something we should call a safe ffi version? … <#m_3150912763288880980_> On Sun, May 16, 2021 at 4:42 PM Chris Done *@*.***> wrote: I wonder about the performance of count combined with par, with the 1GB file split in half and processed in parallel. 350ms is a long time for my other cores to be doing nothing. 🤔 — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#202 (comment) <#202 (comment)>>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAABBQW2WKI6KEPQRP7PEO3TOAU2PANCNFSM4KO4U6PQ . Wouldn't that conflict with the push the to us unpinned memory? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#202 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAABBQSZBUE7D7PQ6KURZLLTOBXTJANCNFSM4KO4U6PQ> .

vdukhovni · 2021-05-17T03:55:33Z

My impression was that "unsafe" calls are not preempted by the GC, and don't necessarily need pinned memory. While "safe" calls do. Since there are proposals in flight to move ByteString to unpinned ByteArray, I'd expect those to be less amenable for use with "safe" calls.

hsyl20 · 2021-05-17T10:21:28Z

If there are no further suggestions (ping @hsyl20), I'll merge it in a couple of days.

Sorry for not being very active these days (first baby at home is taking quite some time). I don't have much to add anyway. Good work! I also wish we had a better story for simd primops in GHC but we don't even have a good story for sub-word (Word8#, Word16€, Word32#...) yet so it will take some time.

cartazio · 2021-05-17T15:46:25Z

I think bytestring won’t switch to unconditionally unpinned memory , but rather have pinned and unpinned flavors. (It’s a deeply crash heavy breaking change to silently make them unpinned. May as well be new modules or a new package at that point :) )

* Implement SSE4.2 and AVX2-based variants of the `count` * Address PR comments wrt SIMD-based character counting * Add test for `count` on longer strings * Add benchmarks for strict bytestring's count method * Fix tabs * Use C11's call_once instead of TLS for choosing counting SIMD function * Build cbits with -std=c11 * Add a point to `count` benchmarks slightly above the SIMD threshold * Initialize the `count` SIMD ptr with atomics * Enable `count` SIMD support only if atomics are available Turns out a conforming C implementation might be running on a platform with threads, but with no atomics in the C stdlib.

0xd34df00d force-pushed the master branch from 04796d9 to bc6b117 Compare February 13, 2020 21:23

0xd34df00d changed the title ~~Use memchr in count~~ Implement count with SSE 4.2 and AVX2 Feb 13, 2020

cartazio reviewed May 17, 2020

View reviewed changes

Bodigrim added blocked: patch-needed somebody needs to write a patch or contribute code performance labels Aug 21, 2020

sjakobi mentioned this pull request Aug 28, 2020

Faster c_count? Perhaps a GHC rather than bytestring issue? #274

Closed

Bodigrim mentioned this pull request Jan 9, 2021

Add CI build on Ubuntu 16.04 #343

Merged

Use C11's call_once instead of TLS for choosing counting SIMD function

13f43a4

0xd34df00d force-pushed the master branch from a629166 to 13f43a4 Compare May 2, 2021 22:13

0xd34df00d added 4 commits May 5, 2021 09:52

Build cbits with -std=c11

c7481ef

Add a point to count benchmarks slightly above the SIMD threshold

90ec933

Initialize the count SIMD ptr with atomics

27ed60c

Enable count SIMD support only if atomics are available

bba08bc

Turns out a conforming C implementation might be running on a platform with threads, but with no atomics in the C stdlib.

Bodigrim approved these changes May 8, 2021

View reviewed changes

Bodigrim merged commit 44edcaa into haskell:master May 15, 2021

This was linked to issues May 15, 2021

Much faster count function in plain Haskell #109

Closed

Faster c_count? Perhaps a GHC rather than bytestring issue? #274

Closed

Bodigrim added this to the 0.11.2.0 milestone May 16, 2021

Implement count with SSE 4.2 and AVX2 #202

Implement count with SSE 4.2 and AVX2 #202

Conversation

0xd34df00d commented Feb 2, 2020

sjakobi commented Feb 3, 2020

0xd34df00d commented Feb 6, 2020

0xd34df00d commented Feb 6, 2020 • edited Loading

sjakobi commented Feb 6, 2020

0xd34df00d commented Feb 13, 2020

Bodigrim commented May 10, 2020

cartazio commented May 17, 2020

cartazio commented May 17, 2020

cartazio May 17, 2020

Choose a reason for hiding this comment

cartazio commented May 17, 2020

cartazio commented May 17, 2020

cartazio commented May 17, 2020

0xd34df00d commented May 22, 2020

cartazio commented May 22, 2020

cartazio commented May 22, 2020

sjakobi commented May 22, 2020 • edited Loading

cartazio commented May 22, 2020

cartazio commented May 22, 2020

0xd34df00d commented May 22, 2020

Bodigrim commented Aug 21, 2020

Bodigrim commented Jan 6, 2021

Bodigrim commented Jan 9, 2021

0xd34df00d commented Jan 9, 2021

Bodigrim commented Jan 9, 2021

0xd34df00d commented May 2, 2021

vdukhovni commented May 2, 2021

0xd34df00d commented May 2, 2021

vdukhovni commented May 2, 2021

Bodigrim commented May 2, 2021

0xd34df00d commented May 8, 2021

Bodigrim commented May 8, 2021

vdukhovni commented May 8, 2021

cartazio commented May 8, 2021 via email

Bodigrim commented May 15, 2021

chrisdone commented May 16, 2021

cartazio commented May 17, 2021 via email

vdukhovni commented May 17, 2021

cartazio commented May 17, 2021 via email

vdukhovni commented May 17, 2021

hsyl20 commented May 17, 2021

cartazio commented May 17, 2021

Implement `count` with SSE 4.2 and AVX2 #202

Implement `count` with SSE 4.2 and AVX2 #202

0xd34df00d commented Feb 6, 2020 •

edited

Loading

sjakobi commented May 22, 2020 •

edited

Loading