cudaFuncSetCacheConfig is setting CachePreferShared too often #2066

ambrad · 2019-04-02T01:48:48Z

Is there a way that a Kokkos client can call cudaFuncSetCacheConfig? Right now, it appears that CudaParallelLaunch always calls it on new GPUs, and it also appears that it always passes cudaFuncCachePreferShared for TeamPolicy, even if the user does not request shared memory. I have found in some kernels that if I mod that call in the Kokkos source code to prefer L1, the kernels run substantially faster.

If there is not a way, I suggest TeamPolicy take something like Kokkos::CachePreferShared or Kokkos::CachePreferL1 as an option.

dhollman · 2019-04-03T15:05:43Z

"Prefer shared" or "prefer L1" are not really descriptive, declarative traits in line with the kinds of information we get from users in the Kokkos programming model. What does this mean on anything other than a GPU? What if the structure and performance characteristics of L1 is different on future (or past) GPUs? (Actually, that's already the case with some past GPU architectures we support.) Kokkos is a declarative performance portability layer, and so we need declarative traits that will map to different things on different hardware.

We're definitely not opposed to providing an execution policy property that maps to this behavior on current GPUs, which will give you the performance you want. The point is that it's not as simple as giving the properties the names you suggested. Can you come up with a descriptive property of your kernel that generalizes why your kernel needs to prefer shared memory or prefer L1? (I realize this is a much harder question to answer, but in the long run it's these sorts of hard questions that have made Kokkos so portable to date.)

vbrunini · 2019-04-03T15:58:42Z

Does this need to be exposed to the user as an additional trait? The TeamPolicy already knows how much shared memory a kernel has requested per-team or per-thread. Couldn't the Cuda backend just look at that and choose whether to set prefer shared or prefer cache based on the amount of shared memory usage?

ambrad · 2019-04-03T19:28:26Z

Thanks for the comments.

@vbrunini, yes, exactly. I didn't word my issue clearly enough. I'm pretty sure this is a legitimate bug. I believe at one time Kokkos was doing or trying to do exactly as you write. Then at some point Kokkos started allocating a little bit of shared mem for TeamPolicy (edit: for its own use, i.e., for the use exclusively of the Kokkos backend). But that little allocation (in my speculative narrative reconstruction) tripped up the original logic, making TeamPolicy always choose PreferShared and never L1, which was not originally intended.

The fix, IMO, should simply be: if the user (not the backend) asks for 0 shared memory, then definitely PreferL1. For user allocations >0, there is some nuance, but the common use case is 0.

Edit for context: P100 has separate shared and L1 memory, so the preference does, AFAIK, exactly nothing on P100. But V100 unifies the shared and L1 memories and displays clear and highly repeatable performance results depending on the setting.

ambrad · 2019-04-06T01:15:32Z

The attached patch gives good speedup on V100 for two applications, including one that has TeamPolicy-based kernels both requesting and not requested shmem; this patch does the right thing in both cases. It's meant to preserve CudaParallelLaunch's current behavior unless an optional new argument is provided to it. It's conservative in the sense that it does not change parallel_reduce or scan, only for.

kokkos-L1-vs-shared-patch.txt

crtrott · 2019-04-07T01:49:48Z

I designate this a performance bug. I know roughly when this went astray - its just that performance testing on P100 didn't uncover this :-(.

dhollman assigned dhollman and crtrott Apr 3, 2019

dhollman added the Feature Request Create new capability; will potentially require voting label Apr 3, 2019

crtrott added Blocks Promotion Overview issue for release-blocking bugs Bug Broken / incorrect code; it could be Kokkos' responsibility, or others’ (e.g., Trilinos) and removed Feature Request Create new capability; will potentially require voting labels Apr 7, 2019

crtrott mentioned this issue May 1, 2019

Kokkos 2.9 Release #2123

Closed

crtrott mentioned this issue May 13, 2019

Cuda block size and function overload #2133

Merged

crtrott added bug - fix pushed to develop branch labels May 13, 2019

crtrott changed the title ~~Caller control of cudaFuncSetCacheConfig~~ cudaFuncSetCacheConfig is setting CachePreferShared too often May 13, 2019

ndellingwood removed the Blocks Promotion Overview issue for release-blocking bugs label Jun 24, 2019

ndellingwood closed this as completed Jun 24, 2019

ndellingwood mentioned this issue Jun 24, 2019

Kokkos promotion trilinos/Trilinos#5432

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cudaFuncSetCacheConfig is setting CachePreferShared too often #2066

cudaFuncSetCacheConfig is setting CachePreferShared too often #2066

ambrad commented Apr 2, 2019

dhollman commented Apr 3, 2019

vbrunini commented Apr 3, 2019

ambrad commented Apr 3, 2019 •

edited

ambrad commented Apr 6, 2019

crtrott commented Apr 7, 2019

cudaFuncSetCacheConfig is setting CachePreferShared too often #2066

cudaFuncSetCacheConfig is setting CachePreferShared too often #2066

Comments

ambrad commented Apr 2, 2019

dhollman commented Apr 3, 2019

vbrunini commented Apr 3, 2019

ambrad commented Apr 3, 2019 • edited

ambrad commented Apr 6, 2019

crtrott commented Apr 7, 2019

ambrad commented Apr 3, 2019 •

edited