-
Notifications
You must be signed in to change notification settings - Fork 140
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement count
with SSE 4.2 and AVX2
#202
Conversation
Interesting! I'm wondering why Lines 1001 to 1013 in 95fe6bd
Lines 1085 to 1089 in 95fe6bd
c29225b mentions using It would be nice if you could share your benchmark but I guess that's somewhat blocked on #196. #109 also looks related. |
My benchmark is something really silly, along the lines of import qualified Data.ByteString as BS
import System.IO.Posix.MMap
main :: IO ()
main = do
contents <- unsafeMMapFile "/huge/file/on/tmpfs.txt"
print $ BS.count 10 contents The file size is about 2 gigs, and it's on a Thanks for the pointer to #109 , I missed it! While I'm at this task I can just as well try to brush up my SIMD intrinsics skills and also implement a version using In case I get decent results, what's __attribute__((target("sse42")))
void foo() { ... }
__attribute__((target("avx")))
void foo() { ... }
/* fallback */
void foo() { ... } gcc can generate some dispatching code to select the right implementation at runtime according to the current (runtime) CPU architecture. |
Alright, did some experimenting with
So I guess if every other character would be the character in question, My gut feel is that with __attribute__((target("sse4.2")))
unsigned long fps_count(unsigned char *str, unsigned long len, unsigned char w) {
__m128i pat = _mm_set1_epi8(w);
const int mode = _SIDD_SBYTE_OPS | _SIDD_CMP_EQUAL_EACH;
unsigned long res = 0;
size_t i = 0;
for (; i < len && (intptr_t)(str + i) % 64; ++i) {
res += str[i] == w;
}
for (; i < len - 64; i += 64) {
__m128i p1 = _mm_load_si128((const __m128i*)(str + i + 16 * 0));
__m128i p2 = _mm_load_si128((const __m128i*)(str + i + 16 * 1));
__m128i p3 = _mm_load_si128((const __m128i*)(str + i + 16 * 2));
__m128i p4 = _mm_load_si128((const __m128i*)(str + i + 16 * 3));
__m128i r1 = _mm_cmpestrm(p1, 16, pat, 16, mode);
__m128i r2 = _mm_cmpestrm(p2, 16, pat, 16, mode);
__m128i r3 = _mm_cmpestrm(p3, 16, pat, 16, mode);
__m128i r4 = _mm_cmpestrm(p4, 16, pat, 16, mode);
res += _popcnt64(_mm_extract_epi64(r1, 0));
res += _popcnt64(_mm_extract_epi64(r2, 0));
res += _popcnt64(_mm_extract_epi64(r3, 0));
res += _popcnt64(_mm_extract_epi64(r4, 0));
}
for (; i < len; ++i) {
res += str[i] == w;
}
return res;
} Let's probably start discussing how SIMD extensions-dependent things should be integrated? |
Wow, those results with
Indeed – I can't be of much help there though! :) @0xd34df00d You can also try garnering more feedback on the libraries mailing list. |
@sjakobi spent some more time at this, and AVX2 looks even better (see the table at https://github.com/0xd34df00d/counting-chars)! I made some effort to make sure this still works on machines that don't have AVX2 (or even SSE4.2), but I'm not sure about the treatment of compilers other than gcc and clang, so ideally some other pair of eyes better suited for this sort of stuff than me would take a look at this. |
@0xd34df00d this is a very inspiring result! Could you please raise a question about SIMD policies on Libraries maillist? |
THIS IS REAALLY COOL DEADBEEF, though of course i need to read it much close :) |
for what its worth, theres no simd optimization guidelines for cbits parts of haskell libraries, and the main bottleneck /engineering challenge is having the right dynamic code selection piece, which this PR seems to do! one important question is always: whats the overhead of the dynamic dispatch on smaller inputs? (eg, is it better to only do the simd when inputs are above some size threshold?) plus stuff about which cpu generations actually benefit (which is tricky!) |
return fps_count_naive(str, len, w); | ||
#else | ||
if (len <= 1024) { | ||
return fps_count_naive(str, len, w); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh sweet, it does do the fallback for small inputs :)
your dynamic dispatch looks correct, in isolation, but, if i'm reading the gcc/clang docs about function multi versioning, i think you're either using it wrong or underusing it. Or i need to read/play with those more and understand the tricks myself. reference urls: https://gcc.gnu.org/onlinedocs/gcc-10.1.0/gcc/Function-Multiversioning.html#Function-Multiversioning to be clear: my fuzzy understanding is that if the file was compiled in c++ mode, then the different attributes but same function name would do an auto select? (and here you're doing a dynamic dispatch via c either way, so they have deliberately different names). --- also clang apparently has different syntax for the attribute stuff.. to be clear, i literally learned this was a feature today :) , so thats great! |
derp https://releases.llvm.org/7.0.0/tools/clang/docs/AttributeReference.html#target-gnu-target is the clang analogue? |
Yay, thank you @cartazio for taking a look at this! Regarding the dynamic dispatch — yes, gcc surely can and will generate the dispatching code from these attributes. clang, on the other hand, doesn't seem to always do this — at least, in my experience I had to manually write a similar dispaching code about 3 years ago. Maybe things got better since then! Of course, I'm all for dropping manual detection and dispatch (the less code, the better), but I'm not entirely comfortable doing so given the experience above. Also, speaking of the overhead of dynamic dispatch — in any case it only happens once per OS/C thread (or what's the right Haskell-lingo for this), and then it's a matter of a single comparison of a thread-local function pointer against 0, which a CPU's branch predictor will hopefully very quickly learn. My primary reason for using the naive function for smaller inputs is to avoid all the overhead for unaligned prologue/epilogue handlers of the SIMD versions. |
no, no, it looks fine, i'm just not often given the experience of seeing well written dispatching portable simd. I think that you dont need to do the attribute annotation if you're doing the manual dispatch? otoh, i literally have zero experience with this stuff so not sure :) . i think the naive fall back for smaller inputs is the right choice though, so that seems fine to me. |
lets make sure the build goes green and this code works with both recent gcc and clang both! :) |
A good compatibility test would be to build this on GHC's CI infrastructure. I've meant to properly document this for a while, but essentially you
|
@0xd34df00d some build errors you should fix ;)
|
i think the build is erroring there, but idk :) |
That depends on the compiler flags like Re the build error, that's interesting, I'll take a look! Might be a different standard lib with different intrinsics exposed. Are you building on mac by any chance? @sjakobi, thanks for the instructions, will do! |
@0xd34df00d Travis build complains about undefined |
@0xd34df00d are you still interested in this PR? |
These are probably solvable issues, but such amount of conditional compilation makes me uncomfortable to proceed. We are not in a position to test @vdukhovni what's your take? |
@Bodigrim sorry for long reply, this completely fell out of my attention span. I'm definitely still interested! Frankly, neither am I comfortable with such amount of platforms that I cannot test this on (especially for such an important and foundational library). Short-term, seems like the best way is to have a cabal flag, controlling whether the hand-written SIMD implementation is used for this function (and possibly others), and disabled by default. What I'd like to see as an ideal long-term solution is this function implemented purely in terms of GHC's SIMD primops. In fact, I tried that, but some things are missing in the set of available primops (namely, there's an |
Besides usability concerns (switching a flag of Supporting an additional cabal flag is very costly from maintenance perspective. More configurations to test, more potential build failures, more support. I'm not opposed to SIMD in |
Ok, implemented this with C11's |
Well, C11 On MacOS "Big Sur", there is no |
I guess this means POSIX's |
Well, pthread_once is an option on Unix-like systems, assuming explicit linking with My opinion remains that these sort of primitives belong in GHC itself as new priomops, and not so much in |
Well, I've heard talks about vectorized operations in GHC for years without much fruition. Even if suddenly there emerges a champion to push this stuff forward, it would still take a long time before My personal opinion is that vectorized operations would not find way into GHC primitives until there accumulates a critical mass of libraries employing them. Then it would be easy to make a case in favor of. |
Turns out a conforming C implementation might be running on a platform with threads, but with no atomics in the C stdlib.
@vdukhovni I skimmed superficially through what GHC has on SIMD primops some time ago, and, while I more than agree with your sentiment for multiple reasons, I see several obstacles (in fact, I tried writing the above with primops, and stumbled upon a few missing ones). Firstly, a more technical one: if I understand correctly, currently GHC only allows SIMD primops from Secondly, a more conceptual one: I'm not sure how to expose all the interesting instructions in an architecture-agnostic way. The usual things like vector addition are easy — they're present in pretty much any SIMD extension set with the same semantics. But if we are starting to consider things like One way to solve the latter is by allowing the user to define their own primops, that might look like cmpestrm# :: Word8x16# -> Word8x16# -> Imm8# -> Word8x16#
cmpestrm# s1 s2 imm8 = [ghcMagic| pcmpestrm s1 s2 imm8 |] or something along those lines (where Although I'm not sure the depth of the can of worms it opens: in fact, the above definition should also accept lengths that shall be stored in |
Benchmarks look impressive, 70-85% faster for long strings! |
What I had in mind is not so much primops for non-portable low-level instructions, but in fact direct support for higher-level primitives (like memchr(), ...) in GHC via appropriate low-level primitives when available, or naive loops otherwise. These should work not only for |
I agree with deadbeef, good simd api design is out of reach for ghc any
time soon.
…On Sat, May 8, 2021 at 2:10 PM Viktor Dukhovni ***@***.***> wrote:
Secondly, a more conceptual one: I'm not sure how to expose all the
interesting instructions in an architecture-agnostic way. The usual things
like vector addition are easy — they're present in pretty much any SIMD
extension set with the same semantics. But if we are starting to consider
things like cmpestrm
<https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm_cmpestrm&expand=835>,
that becomes entirely non-obvious — I bet ARM doesn't have a direct
substitute for this instruction.
What I had in mind is not so much primops for non-portable low-level
instructions, but in fact direct support for higher-level primitives (like
memchr(), ...) in GHC via appropriate low-level primitives when available,
or naive loops otherwise. These should work not only for ByteString but
also for ByteArray...
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#202 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAABBQVZSASYBQUOIWCOUPLTMV5ANANCNFSM4KO4U6PQ>
.
|
Thanks, @0xd34df00d! |
* Implement SSE4.2 and AVX2-based variants of the `count` * Address PR comments wrt SIMD-based character counting * Add test for `count` on longer strings * Add benchmarks for strict bytestring's count method * Fix tabs * Use C11's call_once instead of TLS for choosing counting SIMD function * Build cbits with -std=c11 * Add a point to `count` benchmarks slightly above the SIMD threshold * Initialize the `count` SIMD ptr with atomics * Enable `count` SIMD support only if atomics are available Turns out a conforming C implementation might be running on a platform with threads, but with no atomics in the C stdlib.
I wonder about the performance of |
So if bytestring length is greater than say 100000 or 10_000 or something
we should call a safe ffi version?
…On Sun, May 16, 2021 at 4:42 PM Chris Done ***@***.***> wrote:
I wonder about the performance of count combined with par, with the 1GB
file split in half and processed in parallel. 350ms is a long time for my
other cores to be doing nothing. 🤔
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#202 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAABBQW2WKI6KEPQRP7PEO3TOAU2PANCNFSM4KO4U6PQ>
.
|
Wouldn't that conflict with the push the to us unpinned memory? |
Nope. It’s safe precisely because it’s pinned !
Alternatively, we could split it into a sequence of slices that yield back
to hs to call per slice.
My personal rule of thumb is to make sure ffi things that take more than
10 microseconds get split up to play nicer with the scheduler
On Sun, May 16, 2021 at 9:39 PM Viktor Dukhovni ***@***.***>
wrote:
… So if bytestring length is greater than say 100000 or 10_000 or something
we should call a safe ffi version?
… <#m_3150912763288880980_>
On Sun, May 16, 2021 at 4:42 PM Chris Done *@*.***> wrote: I wonder about
the performance of count combined with par, with the 1GB file split in half
and processed in parallel. 350ms is a long time for my other cores to be
doing nothing. 🤔 — You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#202 (comment)
<#202 (comment)>>,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AAABBQW2WKI6KEPQRP7PEO3TOAU2PANCNFSM4KO4U6PQ
.
Wouldn't that conflict with the push the to us unpinned memory?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#202 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAABBQSZBUE7D7PQ6KURZLLTOBXTJANCNFSM4KO4U6PQ>
.
|
My impression was that "unsafe" calls are not preempted by the GC, and don't necessarily need pinned memory. While "safe" calls do. Since there are proposals in flight to move ByteString to unpinned ByteArray, I'd expect those to be less amenable for use with "safe" calls. |
Sorry for not being very active these days (first baby at home is taking quite some time). I don't have much to add anyway. Good work! I also wish we had a better story for simd primops in GHC but we don't even have a good story for sub-word (Word8#, Word16€, Word32#...) yet so it will take some time. |
I think bytestring won’t switch to unconditionally unpinned memory , but rather have pinned and unpinned flavors. (It’s a deeply crash heavy breaking change to silently make them unpinned. May as well be new modules or a new package at that point :) ) |
* Implement SSE4.2 and AVX2-based variants of the `count` * Address PR comments wrt SIMD-based character counting * Add test for `count` on longer strings * Add benchmarks for strict bytestring's count method * Fix tabs * Use C11's call_once instead of TLS for choosing counting SIMD function * Build cbits with -std=c11 * Add a point to `count` benchmarks slightly above the SIMD threshold * Initialize the `count` SIMD ptr with atomics * Enable `count` SIMD support only if atomics are available Turns out a conforming C implementation might be running on a platform with threads, but with no atomics in the C stdlib.
This provides a fairly significant performance boost:
count
ing lines viaBS.count 10
was taking about 1.1 s with anmmap
ed 1.8 gigabyte file, and now it takes 0.35 s. For comparison,length . BS.lines
takes about 0.4 s (IMOcount
should definitely be at least as fast), and Unixwc
takes about 0.3-0.4 s.