Skip to content

runtime: support parallel cache-oblivious algorithms #9129

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
dvyukov opened this issue Nov 19, 2014 · 5 comments
Open

runtime: support parallel cache-oblivious algorithms #9129

dvyukov opened this issue Nov 19, 2014 · 5 comments
Milestone

Comments

@dvyukov
Copy link
Member

dvyukov commented Nov 19, 2014

Cache-oblivious algorithms automatically take advantage of all levels of caches present
in the system (register set, L1/L2/L2, disk) by recursively sub-dividing the problem
into smaller parts:
http://en.wikipedia.org/wiki/Cache-oblivious_algorithm

Recursive divide-and-conquer is the working horse of lots of parallel algorithms.

Here is a sample program that does simple O(N^2) computation using 4 methods:
1. sequential naive (iterative)
2. sequential cache-oblivious
3. parallel naive (iterative)
4. parallel cache-oblivious

http://play.golang.org/p/P4vh-GxnRz

Results are:

serial sum... 3m0.149312702s (sum=480293945344)
cache oblivious sum... 1m59.191344136s (sum=480293945344)
parallel sum(1)... 2m45.547188918s (sum=480293945344)
parallel sum(2)... 1m29.9953968s (sum=480293945344)
parallel sum(4)... 46.828645173s (sum=480293945344)
parallel sum(8)... 27.005117917s (sum=480293945344)
parallel sum(16)... 13.942930282s (sum=480293945344)
parallel sum(32)... 10.920164632s (sum=480293945344)
cache oblivious parallel sum(1)... 2m32.863516062s (sum=480293945344)
cache oblivious parallel sum(2)... 1m13.964254945s (sum=480293945344)
cache oblivious parallel sum(4)... 43.00664982s (sum=480293945344)
cache oblivious parallel sum(8)... 21.625423811s (sum=480293945344)
cache oblivious parallel sum(16)... 13.566854603s (sum=480293945344)
cache oblivious parallel sum(32)... 9.740402766s (sum=480293945344)

Sequential cache-oblivious algorithm is 33% faster even on this small data set. However,
parallel cache-oblivious version does not show this speedup. The problem is that
parallel cache-oblivious algorithms require LIFO scheduling (or at least as close to
LIFO as possible), but current goroutine scheduler does FIFO. As the result the
algorithm does BFS instead of DFS, this not only breaks cache locality, but also
increases memory consumption from N to 2^N (where N is the depth of the
divide-and-conquer tree).

Currently a user has to make a choice between efficient cache usage or parallelization,
but not both. One way or another we need to support such algorithms.
@RLH
Copy link
Contributor

RLH commented Nov 19, 2014

Comment 1:

At the end of the day didn't this work end up with a Cilk style work
stealing scheduler that spawned the parent, executed the child, and stole
the oldest parent? Is this what you are proposing?

@dvyukov
Copy link
Member Author

dvyukov commented Nov 19, 2014

Comment 2:

Yes, I am proposing to use Cilk-style scheduling for such computations (but preserving
some degree of fairness).

@rsc
Copy link
Contributor

rsc commented Nov 19, 2014

Comment 3:

It's fine to file bugs for this kind of thing so we remember it, but this is not even
low priority right now. It's zero priority. We have much more important things to fix.
I am going to push back very strongly on structural runtime changes during 1.5 other
than the concurrent GC.

@btracey
Copy link
Contributor

btracey commented Dec 1, 2014

Comment 4:

I understand that there is a lot going on in 1.5 that merits a moratorium on large
runtime changes. However, please consider leaving some room in the development tree for
such changes (in 1.6, etc.). There are programs whose bottleneck is not GC, and could
benefit from such changes. For example, we would like to implement such matrix-multiply
algorithms in the Go BLAS implementation.

@egonelbre
Copy link
Contributor

egonelbre commented Jun 21, 2020

Updated numbers using Go 1.15beta1 windows/amd64:

init... 24.9877ms
serial sum... 5m1.493115s (sum=480293945344)
cache oblivious sum... 38.726983s (sum=480293945344)
parallel sum(1)... 4m56.0129857s (sum=480293945344)
parallel sum(2)... 2m27.3990265s (sum=480293945344)
parallel sum(4)... 1m58.1337872s (sum=480293945344)
parallel sum(8)... 1m0.278899s (sum=480293945344)
parallel sum(16)... 37.8720665s (sum=480293945344)
parallel sum(32)... 22.8789963s (sum=480293945344)
cache oblivious parallel sum(1)... 56.2580064s (sum=480293945344)
cache oblivious parallel sum(2)... 27.855001s (sum=480293945344)
cache oblivious parallel sum(4)... 16.6889919s (sum=480293945344)
cache oblivious parallel sum(8)... 9.2350009s (sum=480293945344)
cache oblivious parallel sum(16)... 7.322997s (sum=480293945344)
cache oblivious parallel sum(32)... 6.1710005s (sum=480293945344)

The situation seems to have improved significantly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants