Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fractal example: poor scalability and unexpected number of CPUs #17

Closed
facontidavide opened this issue Sep 9, 2019 · 7 comments
Closed

Comments

@facontidavide
Copy link

Hi,

I have been playing around with the only example and I was quite surprised by it's poor performance.

First of all I modified the code to run in a poorly sequential way (no marl). This is the time required by a single CPU:

real	0m0.025s
user	0m0.025s
sys	0m0.000s

Afterward, I modified the code as follows:

int main(int argc, const char** argv) {
  marl::Scheduler scheduler;
  uint32_t num_threads = atoi(argv[1]);
  scheduler.setWorkerThreadCount(num_threads);

Running with argument "1" I expected one CPU to be used, but actually 2 CPUs are used (100% each).

real	0m18.354s
user	0m32.423s
sys	0m3.972s

Argument "2" uses 3 CPUs apparently

real	0m16.790s
user	0m39.521s
sys	0m10.168s

Argument "4" uses 5 CPUs apparently (I have 8)

real	0m15.852s
user	0m49.900s
sys	0m25.362s

So, basically, the only example provided so far seems to suggest that marl kind of... disappoints, spending most of its time is spent "somewhere" in a blocking operation.

After some profiling, I have the feeling that the fault is a mutex in the function rand(). See attached flamegraph.

image

My suggestion is to either fix this (I don't know how) or provide a different example where we can actually say: "hey, look how scalable it is with the number of CPUs"!

I am also puzzled by the fact that the number of CPUs used is always equal to (num_threads + 1).

@facontidavide facontidavide changed the title Fractal example: poor sclability and unexpected number of CPU Fractal example: poor scalability and unexpected number of CPUs Sep 9, 2019
@facontidavide
Copy link
Author

ben-clayton added a commit to ben-clayton/marl that referenced this issue Sep 9, 2019
As @facontidavide points out in google#17, some standard libraries have a mutex lock in rand() that can dramatically hurt performance when run across multiple threads.

Instead do subsampling in a regular n x m pattern, removing the need to call `rand()` at all. This also has the benefit of making the output deterministic.
@ben-clayton
Copy link
Contributor

ben-clayton commented Sep 9, 2019

Hi @facontidavide,

Thank you for reporting this, this is an interesting problem that doesn't seem to affect the standard library found with the macOS XCode toolchain (which is where I wrote and tested this example). This is what I see:

setWorkerThreadCount Time taken
0 0m14.397s
1 0m7.483s
2 0m5.389s
3 0m4.235s
4 0m3.517s
5 0m3.173s
6 0m2.893s

That said, I've put a PR together that completely eliminates the use of rand() for regular grid subsampling. (I also see you've also put together a change, I'll take a look in a sec).

Running with argument "1" I expected one CPU to be used, but actually 2 CPUs are used (100% each).

I expect to see 2 hardware threads created as setWorkerThreadCount() specifies the number of dedicated worker threads to be spawned (main thread + dedicated worker = 2). Seeing the main thread churn at 100% might be due to the main thread being starved of work at wg.wait(), and spinning in Scheduler::Worker::spinForWork(). This spin-down should only last 1ms though, so I'll take a look my end and see if I can reproduce.

See explanation below.

Cheers!
Ben

@ben-clayton
Copy link
Contributor

Regardless of whether we go with #19 or #18, do both of these fix the lack-of-expected-performance-gain issue you originally describe @facontidavide ?

@facontidavide
Copy link
Author

facontidavide commented Sep 9, 2019

yes, both work on my computer (Ubuntu 16.04). By "work" I mean that I can observe a scalability qualitatively similar to yours.

#18 should be preferred because it is actually faster and, as you mentioned, deterministic.

@ben-clayton
Copy link
Contributor

First of all I modified the code to run in a poorly sequential way (no marl). This is the time required by a single CPU:

Are you sure about those numbers?

@ben-clayton
Copy link
Contributor

If I disable all marl parallelization work I get:

real    0m15.314s
user    0m15.272s
sys     0m0.028s

Which is in line with what I'd expect.

@ben-clayton
Copy link
Contributor

The reason we see 2 CPUs working at 100% is that the main thread has become entirely blocked, and so "joins in the party" to start processing tasks.
Note: This is the standard behaviour of task execution when there are no dedicated worker threads (setWorkerThreadCount(0)) - it will only start executing tasks on a non-dedicated thread when there is nothing else for it to do.

ben-clayton added a commit that referenced this issue Sep 9, 2019
As @facontidavide points out in #17, some standard libraries have a mutex lock in rand() that can dramatically hurt performance when run across multiple threads.

Instead do subsampling in a regular n x m pattern, removing the need to call `rand()` at all. This also has the benefit of making the output deterministic.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants