Join GitHub today
GitHub is home to over 36 million developers working together to host and review code, manage projects, and build software together.Sign up
runtime: support parallel cache-oblivious algorithms #9129
Cache-oblivious algorithms automatically take advantage of all levels of caches present in the system (register set, L1/L2/L2, disk) by recursively sub-dividing the problem into smaller parts: http://en.wikipedia.org/wiki/Cache-oblivious_algorithm Recursive divide-and-conquer is the working horse of lots of parallel algorithms. Here is a sample program that does simple O(N^2) computation using 4 methods: 1. sequential naive (iterative) 2. sequential cache-oblivious 3. parallel naive (iterative) 4. parallel cache-oblivious http://play.golang.org/p/P4vh-GxnRz Results are: serial sum... 3m0.149312702s (sum=480293945344) cache oblivious sum... 1m59.191344136s (sum=480293945344) parallel sum(1)... 2m45.547188918s (sum=480293945344) parallel sum(2)... 1m29.9953968s (sum=480293945344) parallel sum(4)... 46.828645173s (sum=480293945344) parallel sum(8)... 27.005117917s (sum=480293945344) parallel sum(16)... 13.942930282s (sum=480293945344) parallel sum(32)... 10.920164632s (sum=480293945344) cache oblivious parallel sum(1)... 2m32.863516062s (sum=480293945344) cache oblivious parallel sum(2)... 1m13.964254945s (sum=480293945344) cache oblivious parallel sum(4)... 43.00664982s (sum=480293945344) cache oblivious parallel sum(8)... 21.625423811s (sum=480293945344) cache oblivious parallel sum(16)... 13.566854603s (sum=480293945344) cache oblivious parallel sum(32)... 9.740402766s (sum=480293945344) Sequential cache-oblivious algorithm is 33% faster even on this small data set. However, parallel cache-oblivious version does not show this speedup. The problem is that parallel cache-oblivious algorithms require LIFO scheduling (or at least as close to LIFO as possible), but current goroutine scheduler does FIFO. As the result the algorithm does BFS instead of DFS, this not only breaks cache locality, but also increases memory consumption from N to 2^N (where N is the depth of the divide-and-conquer tree). Currently a user has to make a choice between efficient cache usage or parallelization, but not both. One way or another we need to support such algorithms.
I understand that there is a lot going on in 1.5 that merits a moratorium on large runtime changes. However, please consider leaving some room in the development tree for such changes (in 1.6, etc.). There are programs whose bottleneck is not GC, and could benefit from such changes. For example, we would like to implement such matrix-multiply algorithms in the Go BLAS implementation.