Very bad CounterActor benchmark performance on a manycore machine #38

vasily-kirichenko · 2015-01-04T10:48:44Z

Core i5 (4 physical cores), windows 7 x64:

ChMsg: 4 *      300 msgs =>    24811 msgs/s
ChMsg: 4 *     3000 msgs =>  4049403 msgs/s
ChMsg: 4 *    30000 msgs =>  2429420 msgs/s
ChMsg: 4 *   300000 msgs =>  3767980 msgs/s
ChMsg: 4 *  3000000 msgs =>  4213296 msgs/s
MbMsg: 4 *      300 msgs =>    56186 msgs/s
MbMsg: 4 *     3000 msgs =>  2688413 msgs/s
MbMsg: 4 *    30000 msgs =>  2772842 msgs/s
MbMsg: 4 *   300000 msgs =>  3313607 msgs/s
MbMsg: 4 *  3000000 msgs =>  2959431 msgs/s

Xeon (24 physical cores), windows server 2012R2 x64:

ChMsg: 48 *      300 msgs =>    73654 msgs/s
ChMsg: 48 *     3000 msgs =>   515786 msgs/s
ChMsg: 48 *    30000 msgs =>   600174 msgs/s
ChMsg: 48 *   300000 msgs =>   610965 msgs/s
ChMsg: 48 *  3000000 msgs =>   618576 msgs/s
MbMsg: 48 *      300 msgs =>   135676 msgs/s
MbMsg: 48 *     3000 msgs =>   558427 msgs/s
MbMsg: 48 *    30000 msgs =>   994872 msgs/s
MbMsg: 48 *   300000 msgs =>   872920 msgs/s
MbMsg: <cannot managed to wait this one to finish>

CPU usage - Hopac

CPU usage - MB

An older Xeon (8 cores), windows server 2012R2 x64:

ChMsg: 8 *      300 msgs =>    39138 msgs/s
ChMsg: 8 *     3000 msgs =>  1219054 msgs/s
ChMsg: 8 *    30000 msgs =>  1225679 msgs/s
ChMsg: 8 *   300000 msgs =>  1191693 msgs/s
ChMsg: 8 *  3000000 msgs =>  1203870 msgs/s
MbMsg: 8 *      300 msgs =>   119806 msgs/s
MbMsg: 8 *     3000 msgs =>   596963 msgs/s
MbMsg: 8 *    30000 msgs =>   784205 msgs/s
MbMsg: 8 *   300000 msgs =>   626359 msgs/s
MbMsg: 8 *  3000000 msgs =>   558681 msgs/s

The text was updated successfully, but these errors were encountered:

polytypic · 2015-01-04T15:16:08Z

Yes, this is unfortunately to be expected for that benchmark. That particular benchmark basically measures a kind of pathological parallel program design. In that scenario, there is a single sequential bottleneck, the "counter actor", with which all CPU cores attempt to communicate with under FIFO semantics. When I say this is a "pathological design" I really mean it. You should avoid such bottlenecks when designing parallel programs, because such bottlenecks cannot be made to scale. At best such a design can be made to degrade gracefully as the number of cores is increased.

Now, with that said, it would be possible to change the Hopac implementation to perform better, in other words, to not degrade as quickly as it currently does in that scenario. However, doing so would impact most other benchmarks negatively and most of those benchmarks actually correspond to more viable program designs.

The following items are strongly related:

I would recommend reading through them.

Basically, when implementing low level parallel/concurrent primitives, such as spinlocks, there is often a choice between scalable algorithms (e.g. MCS, CLH) that degrade gracefully in case of high-contention, but have a significant constant overhead in low-contention scenarios, and non-scalable algorithms (e.g. TTAS) that perform well in case of low-contention scenarios, but do have very poor (asymptotically worse) performance in high-contention scenarios (e.g. O(n^2)).

To put this in more concrete terms, in a low contention scenario, with a non scalable (T)TAS spinlock one may need to perform just a single cache-line transfer to take a lock, read and update the protected data and release the lock. Scalable spinlocks will need at least two and often a few more cache-line transfers. In high-contention scenarios those extra transfers ensure that cores will not keep competing for the one and same cache line, but in low contention scenarious those transfers amount to 2x or higher overhead.

Most Hopac primitives have been specifically optimized for low contention scenarios. The are a number of justifications for this. The most important justification for this is that low contention should be the expected scenario, so it makes sense to optimize for that case. Why should low contention be the expected scenario? Because that is the way parallel programs should be designed in the first place. High-contention designs cannot scale—at best they can degrade gracefully.

BTW, one low barrier explanation of this scaling issue and of a more scalable design approach would be the section "A Case Study on Concurrency-Oriented Programming" in chapter 4 of the book "Erlang Programming" by Cesarini and Thompson.

Another, related, justification for optimizing low level primitives for low-contention rather than for high-contention is that it is often possible to reduce contention at a higher-level, while, conversely, given low-level primitives with high-overhead, one cannot eliminate such overheads at a higher-level. Many of the basic approaches (e.g. fan out & randomize) that can be used in low-level primitives to improve scalability can also basically be used at a higher level.

I've already explored many more scalable algorithms while developing Hopac and unfortunately I've yet to find .Net implementable low level primitives that would degrade more gracefully and perform well in low contention scenarios. (Ideally, of course, the .Net and Mono built-in Monitor would perform like champ, but that is not the case.) I do have some ideas for further experiments. For example, a CLH style algorithm with manually implemented memory management might lead to a lock that would perform reasonably well at low contention and would also degrade gracefully. Some other approaches might also work. At the moment, however, these are not very high priority.

Ultimately I think that the better approach to dealing with this issue is not to try to optimize the current channels for high-contention scenarios, but to provide a separate channel type for high-contention scenarios. Such a channel type, or channel types, might even have slightly different semantics, making it possible to do optimization that would be illegal for the regular channels. If you happen to run into a case where high-contention cannot easily be avoided, we could look into implementing that.

BTW, one thing that is also useful to understand is what constitutes high-contention. Nearly all places in Hopac that effectively take locks do so for only the duration of a few instructions. So, the "scope" for contention is very small. Let's assume that a remote DRAM access takes 100 ns. This is the same as 10 million accesses per second. From this one could estimate that if you have a channel that would be (if it could be) accessed 10 million or more times per second (from multiple cores) or in bursts where individual access attempts are not more than 100 ns apart, then you definitely have a high-contention scenario and performance will degrade. On the other hand, one could estimate that if individual accesses are well more than 100ns apart, say 1us (1000ns) apart, or roughly 1 million accesses per second (per channel when accessed from multiple cores), then you are unlikely to suffer from high-contention.

Here is a partial summary:

Current Hopac implementation does not necessarily degrade gracefully in high-contention scenarios.
It would be easy to change Hopac to degrade gracefully in high-contention scenarios, but it would be difficult (currently unknown how) to make that without reducing performance in low-contention scenarios.
High-contention scenarios are pathological parallel program design. They should and often can be avoided. In the worst case, an unavoidable sequential bottleneck, one can use similar techniques to cope with contention at the high-level as one would at the low level.
The current trade-off in Hopac, to optimize for low-contention scenarios, arguably leads to better performance, because it is the expected scenario or target for parallel program designs. However, it also means that cases of high-contention may need to be identified and resolved at a higher level.

vasily-kirichenko · 2015-01-05T09:22:17Z

Thanks!

So Hopac is optimized for scenarios where many processes (jobs) are communicating via channels / ivars / mvars / etc (no places accessed simultaneously by many threads which may (will) become bottlenecks).

I feel I should read a lot to fully understand what you are talking about though :) I started to read "The Art of Multiprocessor Programming" and will read that chapter from the Erlang book.

polytypic · 2015-01-05T15:42:28Z

Perhaps one could say that Hopac components are optimized for size rather than for speed. So you can have millions of channels and jobs rather than just a few channels and threads.

Note that contention only becomes an issue when a channel is accessed by multiple parallel jobs millions of times per second (or in long enough high frequency bursts). For most channels this is simply not going to be the case. The cases where that seems or is necessary need to either be reworked (always better if possible) or use a different channel type (and that can only go so far, because no matter what, there are hard limits as to how many accesses from multiple cores a single FIFO channel can handle per second).

polytypic · 2015-01-05T15:47:28Z

But, indeed, if you have an actual application where a channel becomes a bottleneck due to high contention, then that would certainly be good motivation to implement a channel type that deals better with contention.

vasily-kirichenko closed this as completed Jan 10, 2015

Hopac locked and limited conversation to collaborators Apr 17, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Very bad CounterActor benchmark performance on a manycore machine #38

Very bad CounterActor benchmark performance on a manycore machine #38

vasily-kirichenko commented Jan 4, 2015

polytypic commented Jan 4, 2015

vasily-kirichenko commented Jan 5, 2015

polytypic commented Jan 5, 2015

polytypic commented Jan 5, 2015

Very bad CounterActor benchmark performance on a manycore machine #38

Very bad CounterActor benchmark performance on a manycore machine #38

Comments

vasily-kirichenko commented Jan 4, 2015

polytypic commented Jan 4, 2015

vasily-kirichenko commented Jan 5, 2015

polytypic commented Jan 5, 2015

polytypic commented Jan 5, 2015