Efficient cache-aware and cache-oblivious implementations #79

smartalecH · 2022-12-09T19:43:49Z

There's a great example, diffusion2D_shmem_novis.jl, which shows how to leverage the shared memory of a GPU (thereby taking advantage of spatial locality common with so many stencil codes).

It would be great if we could abstract this slightly such that we could implement cache-aware algorithms on the CPU using the same kernel code as the GPU. In essence, one divides the domain into chunks (or blocks, on a GPU) and then loops over these chunks assigning workers as needed. Unfortunately, one doesn't have direct access to the L1 cache within a CPU (otherwise it wouldn't be called a cache). However it's well known that by tiling the arrays such that they fit into cache, the compiler will keep the data in cache, allowing you to leverage spatial locality (as is often done with CPU-based tiled matrix multiplication).

Currently, the CPU loops over individual indices... so even if someone manually chunked the domain, there would be quite a lot of cache contention. Instead, it would be better to loop over tiles in this case. (there's a bit of extra working ensuring that the same kernel code could be used on a CPU and a GPU because of the extra worker hierarchy within GPUs).

Even better would be an abstract interface to a cache-oblivious implementation that "just works" on CPUs and GPUs (i.e. a recursive technique could be used to automatically detect cache sizes). Cache-oblivious algorithms are great for a number of reasons besides portability. Namely, they tend to leverage all levels of the cache (L2, L3, etc) better than cache-aware algorithms.

I realize this is a somewhat grandiose vision... I'm just throwing it out there for people to toy with 🙂

omlins · 2022-12-15T18:24:27Z

Thanks for your thoughts. We will address cpu optimisations in the LV backend.

omlins · 2024-07-30T17:56:15Z

We have improved CPU performance using Polyester

omlins closed this as completed Jul 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Efficient cache-aware and cache-oblivious implementations #79

Efficient cache-aware and cache-oblivious implementations #79

smartalecH commented Dec 9, 2022

omlins commented Dec 15, 2022

omlins commented Jul 30, 2024

Efficient cache-aware and cache-oblivious implementations #79

Efficient cache-aware and cache-oblivious implementations #79

Comments

smartalecH commented Dec 9, 2022

omlins commented Dec 15, 2022

omlins commented Jul 30, 2024