-
Notifications
You must be signed in to change notification settings - Fork 194
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fractal example: poor scalability and unexpected number of CPUs #17
Comments
As @facontidavide points out in google#17, some standard libraries have a mutex lock in rand() that can dramatically hurt performance when run across multiple threads. Instead do subsampling in a regular n x m pattern, removing the need to call `rand()` at all. This also has the benefit of making the output deterministic.
Hi @facontidavide, Thank you for reporting this, this is an interesting problem that doesn't seem to affect the standard library found with the macOS XCode toolchain (which is where I wrote and tested this example). This is what I see:
That said, I've put a PR together that completely eliminates the use of
I expect to see 2 hardware threads created as See explanation below. Cheers! |
Regardless of whether we go with #19 or #18, do both of these fix the lack-of-expected-performance-gain issue you originally describe @facontidavide ? |
yes, both work on my computer (Ubuntu 16.04). By "work" I mean that I can observe a scalability qualitatively similar to yours. #18 should be preferred because it is actually faster and, as you mentioned, deterministic. |
Are you sure about those numbers? |
If I disable all marl parallelization work I get:
Which is in line with what I'd expect. |
The reason we see 2 CPUs working at 100% is that the main thread has become entirely blocked, and so "joins in the party" to start processing tasks. |
As @facontidavide points out in #17, some standard libraries have a mutex lock in rand() that can dramatically hurt performance when run across multiple threads. Instead do subsampling in a regular n x m pattern, removing the need to call `rand()` at all. This also has the benefit of making the output deterministic.
Hi,
I have been playing around with the only example and I was quite surprised by it's poor performance.
First of all I modified the code to run in a poorly sequential way (no marl). This is the time required by a single CPU:
Afterward, I modified the code as follows:
Running with argument "1" I expected one CPU to be used, but actually 2 CPUs are used (100% each).
Argument "2" uses 3 CPUs apparently
Argument "4" uses 5 CPUs apparently (I have 8)
So, basically, the only example provided so far seems to suggest that marl kind of... disappoints, spending most of its time is spent "somewhere" in a blocking operation.
After some profiling, I have the feeling that the fault is a mutex in the function rand(). See attached flamegraph.
My suggestion is to either fix this (I don't know how) or provide a different example where we can actually say: "hey, look how scalable it is with the number of CPUs"!
I am also puzzled by the fact that the number of CPUs used is always equal to (num_threads + 1).
The text was updated successfully, but these errors were encountered: