Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Efficient cache-aware and cache-oblivious implementations #79

Open
smartalecH opened this issue Dec 9, 2022 · 1 comment
Open

Efficient cache-aware and cache-oblivious implementations #79

smartalecH opened this issue Dec 9, 2022 · 1 comment

Comments

@smartalecH
Copy link

There's a great example, diffusion2D_shmem_novis.jl, which shows how to leverage the shared memory of a GPU (thereby taking advantage of spatial locality common with so many stencil codes).

It would be great if we could abstract this slightly such that we could implement cache-aware algorithms on the CPU using the same kernel code as the GPU. In essence, one divides the domain into chunks (or blocks, on a GPU) and then loops over these chunks assigning workers as needed. Unfortunately, one doesn't have direct access to the L1 cache within a CPU (otherwise it wouldn't be called a cache). However it's well known that by tiling the arrays such that they fit into cache, the compiler will keep the data in cache, allowing you to leverage spatial locality (as is often done with CPU-based tiled matrix multiplication).

Currently, the CPU loops over individual indices... so even if someone manually chunked the domain, there would be quite a lot of cache contention. Instead, it would be better to loop over tiles in this case. (there's a bit of extra working ensuring that the same kernel code could be used on a CPU and a GPU because of the extra worker hierarchy within GPUs).

Even better would be an abstract interface to a cache-oblivious implementation that "just works" on CPUs and GPUs (i.e. a recursive technique could be used to automatically detect cache sizes). Cache-oblivious algorithms are great for a number of reasons besides portability. Namely, they tend to leverage all levels of the cache (L2, L3, etc) better than cache-aware algorithms.

I realize this is a somewhat grandiose vision... I'm just throwing it out there for people to toy with 🙂

@omlins
Copy link
Owner

omlins commented Dec 15, 2022

Thanks for your thoughts. We will address cpu optimisations in the LV backend.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants