-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Very bad CounterActor benchmark performance on a manycore machine #38
Comments
Yes, this is unfortunately to be expected for that benchmark. That particular benchmark basically measures a kind of pathological parallel program design. In that scenario, there is a single sequential bottleneck, the "counter actor", with which all CPU cores attempt to communicate with under FIFO semantics. When I say this is a "pathological design" I really mean it. You should avoid such bottlenecks when designing parallel programs, because such bottlenecks cannot be made to scale. At best such a design can be made to degrade gracefully as the number of cores is increased. Now, with that said, it would be possible to change the Hopac implementation to perform better, in other words, to not degrade as quickly as it currently does in that scenario. However, doing so would impact most other benchmarks negatively and most of those benchmarks actually correspond to more viable program designs. The following items are strongly related:
I would recommend reading through them. Basically, when implementing low level parallel/concurrent primitives, such as spinlocks, there is often a choice between scalable algorithms (e.g. MCS, CLH) that degrade gracefully in case of high-contention, but have a significant constant overhead in low-contention scenarios, and non-scalable algorithms (e.g. TTAS) that perform well in case of low-contention scenarios, but do have very poor (asymptotically worse) performance in high-contention scenarios (e.g. O(n^2)). To put this in more concrete terms, in a low contention scenario, with a non scalable (T)TAS spinlock one may need to perform just a single cache-line transfer to take a lock, read and update the protected data and release the lock. Scalable spinlocks will need at least two and often a few more cache-line transfers. In high-contention scenarios those extra transfers ensure that cores will not keep competing for the one and same cache line, but in low contention scenarious those transfers amount to 2x or higher overhead. Most Hopac primitives have been specifically optimized for low contention scenarios. The are a number of justifications for this. The most important justification for this is that low contention should be the expected scenario, so it makes sense to optimize for that case. Why should low contention be the expected scenario? Because that is the way parallel programs should be designed in the first place. High-contention designs cannot scale—at best they can degrade gracefully. BTW, one low barrier explanation of this scaling issue and of a more scalable design approach would be the section "A Case Study on Concurrency-Oriented Programming" in chapter 4 of the book "Erlang Programming" by Cesarini and Thompson. Another, related, justification for optimizing low level primitives for low-contention rather than for high-contention is that it is often possible to reduce contention at a higher-level, while, conversely, given low-level primitives with high-overhead, one cannot eliminate such overheads at a higher-level. Many of the basic approaches (e.g. fan out & randomize) that can be used in low-level primitives to improve scalability can also basically be used at a higher level. I've already explored many more scalable algorithms while developing Hopac and unfortunately I've yet to find .Net implementable low level primitives that would degrade more gracefully and perform well in low contention scenarios. (Ideally, of course, the .Net and Mono built-in Monitor would perform like champ, but that is not the case.) I do have some ideas for further experiments. For example, a CLH style algorithm with manually implemented memory management might lead to a lock that would perform reasonably well at low contention and would also degrade gracefully. Some other approaches might also work. At the moment, however, these are not very high priority. Ultimately I think that the better approach to dealing with this issue is not to try to optimize the current channels for high-contention scenarios, but to provide a separate channel type for high-contention scenarios. Such a channel type, or channel types, might even have slightly different semantics, making it possible to do optimization that would be illegal for the regular channels. If you happen to run into a case where high-contention cannot easily be avoided, we could look into implementing that. BTW, one thing that is also useful to understand is what constitutes high-contention. Nearly all places in Hopac that effectively take locks do so for only the duration of a few instructions. So, the "scope" for contention is very small. Let's assume that a remote DRAM access takes 100 ns. This is the same as 10 million accesses per second. From this one could estimate that if you have a channel that would be (if it could be) accessed 10 million or more times per second (from multiple cores) or in bursts where individual access attempts are not more than 100 ns apart, then you definitely have a high-contention scenario and performance will degrade. On the other hand, one could estimate that if individual accesses are well more than 100ns apart, say 1us (1000ns) apart, or roughly 1 million accesses per second (per channel when accessed from multiple cores), then you are unlikely to suffer from high-contention. Here is a partial summary:
|
Thanks! So Hopac is optimized for scenarios where many processes (jobs) are communicating via channels / ivars / mvars / etc (no places accessed simultaneously by many threads which may (will) become bottlenecks). I feel I should read a lot to fully understand what you are talking about though :) I started to read "The Art of Multiprocessor Programming" and will read that chapter from the Erlang book. |
Perhaps one could say that Hopac components are optimized for size rather than for speed. So you can have millions of channels and jobs rather than just a few channels and threads. Note that contention only becomes an issue when a channel is accessed by multiple parallel jobs millions of times per second (or in long enough high frequency bursts). For most channels this is simply not going to be the case. The cases where that seems or is necessary need to either be reworked (always better if possible) or use a different channel type (and that can only go so far, because no matter what, there are hard limits as to how many accesses from multiple cores a single FIFO channel can handle per second). |
But, indeed, if you have an actual application where a channel becomes a bottleneck due to high contention, then that would certainly be good motivation to implement a channel type that deals better with contention. |
Core i5 (4 physical cores), windows 7 x64:
Xeon (24 physical cores), windows server 2012R2 x64:
CPU usage - Hopac
CPU usage - MB
An older Xeon (8 cores), windows server 2012R2 x64:
The text was updated successfully, but these errors were encountered: