Fractal example: poor scalability and unexpected number of CPUs #17

facontidavide · 2019-09-09T15:12:00Z

Hi,

I have been playing around with the only example and I was quite surprised by it's poor performance.

First of all I modified the code to run in a poorly sequential way (no marl). This is the time required by a single CPU:

real	0m0.025s
user	0m0.025s
sys	0m0.000s

Afterward, I modified the code as follows:

int main(int argc, const char** argv) {
  marl::Scheduler scheduler;
  uint32_t num_threads = atoi(argv[1]);
  scheduler.setWorkerThreadCount(num_threads);

Running with argument "1" I expected one CPU to be used, but actually 2 CPUs are used (100% each).

real	0m18.354s
user	0m32.423s
sys	0m3.972s

Argument "2" uses 3 CPUs apparently

real	0m16.790s
user	0m39.521s
sys	0m10.168s

Argument "4" uses 5 CPUs apparently (I have 8)

real	0m15.852s
user	0m49.900s
sys	0m25.362s

So, basically, the only example provided so far seems to suggest that marl kind of... disappoints, spending most of its time is spent "somewhere" in a blocking operation.

After some profiling, I have the feeling that the fault is a mutex in the function rand(). See attached flamegraph.

My suggestion is to either fix this (I don't know how) or provide a different example where we can actually say: "hey, look how scalable it is with the number of CPUs"!

I am also puzzled by the fact that the number of CPUs used is always equal to (num_threads + 1).

The text was updated successfully, but these errors were encountered:

facontidavide · 2019-09-09T15:15:19Z

FYI: https://stackoverflow.com/questions/6161322/using-stdlibs-rand-from-multiple-threads

@facontidavide

As @facontidavide points out in google#17, some standard libraries have a mutex lock in rand() that can dramatically hurt performance when run across multiple threads. Instead do subsampling in a regular n x m pattern, removing the need to call `rand()` at all. This also has the benefit of making the output deterministic.

ben-clayton · 2019-09-09T15:42:01Z

Hi @facontidavide,

Thank you for reporting this, this is an interesting problem that doesn't seem to affect the standard library found with the macOS XCode toolchain (which is where I wrote and tested this example). This is what I see:

setWorkerThreadCount	Time taken
0	0m14.397s
1	0m7.483s
2	0m5.389s
3	0m4.235s
4	0m3.517s
5	0m3.173s
6	0m2.893s

That said, I've put a PR together that completely eliminates the use of rand() for regular grid subsampling. (I also see you've also put together a change, I'll take a look in a sec).

Running with argument "1" I expected one CPU to be used, but actually 2 CPUs are used (100% each).

I expect to see 2 hardware threads created as setWorkerThreadCount() specifies the number of dedicated worker threads to be spawned (main thread + dedicated worker = 2). Seeing the main thread churn at 100% might be due to the main thread being starved of work at wg.wait(), and spinning in Scheduler::Worker::spinForWork(). This spin-down should only last 1ms though, so I'll take a look my end and see if I can reproduce.

See explanation below.

Cheers!
Ben

ben-clayton · 2019-09-09T15:50:13Z

Regardless of whether we go with #19 or #18, do both of these fix the lack-of-expected-performance-gain issue you originally describe @facontidavide ?

facontidavide · 2019-09-09T15:55:37Z

yes, both work on my computer (Ubuntu 16.04). By "work" I mean that I can observe a scalability qualitatively similar to yours.

#18 should be preferred because it is actually faster and, as you mentioned, deterministic.

ben-clayton · 2019-09-09T16:08:34Z

First of all I modified the code to run in a poorly sequential way (no marl). This is the time required by a single CPU:

Are you sure about those numbers?

ben-clayton · 2019-09-09T16:15:16Z

If I disable all marl parallelization work I get:

real    0m15.314s
user    0m15.272s
sys     0m0.028s

Which is in line with what I'd expect.

ben-clayton · 2019-09-09T16:32:40Z

The reason we see 2 CPUs working at 100% is that the main thread has become entirely blocked, and so "joins in the party" to start processing tasks.
Note: This is the standard behaviour of task execution when there are no dedicated worker threads (setWorkerThreadCount(0)) - it will only start executing tasks on a non-dedicated thread when there is nothing else for it to do.

@facontidavide

As @facontidavide points out in #17, some standard libraries have a mutex lock in rand() that can dramatically hurt performance when run across multiple threads. Instead do subsampling in a regular n x m pattern, removing the need to call `rand()` at all. This also has the benefit of making the output deterministic.

facontidavide changed the title ~~Fractal example: poor sclability and unexpected number of CPU~~ Fractal example: poor scalability and unexpected number of CPUs Sep 9, 2019

ben-clayton mentioned this issue Sep 9, 2019

examples/fractal.cpp: Don't use rand() #18

Merged

facontidavide mentioned this issue Sep 9, 2019

Use std::uniform_real_distribution instead of rand() #19

Closed

facontidavide closed this as completed Sep 9, 2019

ben-clayton mentioned this issue Sep 14, 2019

Single-threaded workers should probably not steal work #29

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fractal example: poor scalability and unexpected number of CPUs #17

Fractal example: poor scalability and unexpected number of CPUs #17

facontidavide commented Sep 9, 2019

facontidavide commented Sep 9, 2019

ben-clayton commented Sep 9, 2019 •

edited

Loading

ben-clayton commented Sep 9, 2019

facontidavide commented Sep 9, 2019 •

edited

Loading

ben-clayton commented Sep 9, 2019

ben-clayton commented Sep 9, 2019

ben-clayton commented Sep 9, 2019

Fractal example: poor scalability and unexpected number of CPUs #17

Fractal example: poor scalability and unexpected number of CPUs #17

Comments

facontidavide commented Sep 9, 2019

facontidavide commented Sep 9, 2019

ben-clayton commented Sep 9, 2019 • edited Loading

ben-clayton commented Sep 9, 2019

facontidavide commented Sep 9, 2019 • edited Loading

ben-clayton commented Sep 9, 2019

ben-clayton commented Sep 9, 2019

ben-clayton commented Sep 9, 2019

ben-clayton commented Sep 9, 2019 •

edited

Loading

facontidavide commented Sep 9, 2019 •

edited

Loading