This is a bit of a mixed bag of a commit, so I totally understand if you don't want to merge it as-is. I figured I'd let you choose what you like, if anything, instead of just forking silently.
First, thank you for writing the cross-platform + lightweight semaphore! I had had in mind to do something remarkably similar for absolute ages, but just never had the time to get around to it. Now I don't have to :-) Your blog post, too, was high-quality as usual. I've taken the lightweight semaphore and used it to implement a blocking wrapper around both of my lock-free queues (spsc and mpmc).
The changes I made are:
Your unit tests still pass. I can't tell if there's any effect on performance since the tests seem to take a wildly different amount of time to execute each time on my PC (which admittedly is a rather wimpy AMD C-50 dual-core netbook processor -- I had to lower the iteration count temporarily on some of them otherwise it would have taken over half an hour to get the test results, heh).
Thank you for your hard work!
Various tweaks to semaphore, including in particular support for bulk…
… waiting. Also fixed 'Unix Makefiles' generation when on Windows/MinGW64.
FYI, sizeof(int) != sizeof(ssize_t) in 64-bit environment: in AMD64, it will issue more lengthy instructions (64-bit operation on register needs an instruction prefix while 32-bit operation doesn't) for no big deal (I won't expect for more than 2 billion threads to wait !).
@hlide: Yes, that was my intention. I didn't know the instruction length was different, though -- does it make any difference to performance? EDIT: It does not. Both std::atomic<int> and std::atomic<long long> take identical amounts of time to do a FAA and a CAS, respectively, at least on my 64-bit Intel Core 2 processor. See my benchmark here.
In any case, it's not for the number of waiting threads (when the count is negative), but rather for the number of available items (when the count is positive).
@cameron314: as long as the instruction cache can absorb those extra bytes, it won't. Suppose that all the code of a big program was using 'int' and was fine in 64-bit, then you switch 'int' into 'ssize_t', you may end up with a bigger binary size which is not that negligible. The same for data size too (which is as twice as it was so data cache will certainly suffer because a cache line will only handle the half of what it could before).
MinGW support (from cameron314 in #3)
Thanks for the MinGW fix. I'm half tempted by the <windows.h> trick, but I'll leave it alone for now.
@cameron314: off-topic I'm quite interested with your free-lock SPSC and MPMC queues implementations, but I have a question. I need a free-lock MPSC queue and I was wondering how to achieve it (blocking version or not).
Ah, yes, I hadn't considered the cacheline implications. Interesting.
@hlide: Hmm, MPSC. I don't know of any off the top of my head -- you're probably best off picking a general MPMC queue and just using that. Mine might not offer the best characteristics for that scenario, as it's built out of SPMC queues internally. The best way is to try a few and find the one that's fastest for your particular workload (by profiling). Just know that both SPMC and MPSC can be engineered much more efficiently than MPMC.