Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cudaFuncSetCacheConfig is setting CachePreferShared too often #2066

Closed
ambrad opened this issue Apr 2, 2019 · 5 comments
Closed

cudaFuncSetCacheConfig is setting CachePreferShared too often #2066

ambrad opened this issue Apr 2, 2019 · 5 comments
Assignees
Labels
Bug Broken / incorrect code; it could be Kokkos' responsibility, or others’ (e.g., Trilinos)

Comments

@ambrad
Copy link

ambrad commented Apr 2, 2019

Is there a way that a Kokkos client can call cudaFuncSetCacheConfig? Right now, it appears that CudaParallelLaunch always calls it on new GPUs, and it also appears that it always passes cudaFuncCachePreferShared for TeamPolicy, even if the user does not request shared memory. I have found in some kernels that if I mod that call in the Kokkos source code to prefer L1, the kernels run substantially faster.

If there is not a way, I suggest TeamPolicy take something like Kokkos::CachePreferShared or Kokkos::CachePreferL1 as an option.

@dhollman dhollman added the Feature Request Create new capability; will potentially require voting label Apr 3, 2019
@dhollman
Copy link

dhollman commented Apr 3, 2019

"Prefer shared" or "prefer L1" are not really descriptive, declarative traits in line with the kinds of information we get from users in the Kokkos programming model. What does this mean on anything other than a GPU? What if the structure and performance characteristics of L1 is different on future (or past) GPUs? (Actually, that's already the case with some past GPU architectures we support.) Kokkos is a declarative performance portability layer, and so we need declarative traits that will map to different things on different hardware.

We're definitely not opposed to providing an execution policy property that maps to this behavior on current GPUs, which will give you the performance you want. The point is that it's not as simple as giving the properties the names you suggested. Can you come up with a descriptive property of your kernel that generalizes why your kernel needs to prefer shared memory or prefer L1? (I realize this is a much harder question to answer, but in the long run it's these sorts of hard questions that have made Kokkos so portable to date.)

@vbrunini
Copy link
Contributor

vbrunini commented Apr 3, 2019

Does this need to be exposed to the user as an additional trait? The TeamPolicy already knows how much shared memory a kernel has requested per-team or per-thread. Couldn't the Cuda backend just look at that and choose whether to set prefer shared or prefer cache based on the amount of shared memory usage?

@ambrad
Copy link
Author

ambrad commented Apr 3, 2019

Thanks for the comments.

@vbrunini, yes, exactly. I didn't word my issue clearly enough. I'm pretty sure this is a legitimate bug. I believe at one time Kokkos was doing or trying to do exactly as you write. Then at some point Kokkos started allocating a little bit of shared mem for TeamPolicy (edit: for its own use, i.e., for the use exclusively of the Kokkos backend). But that little allocation (in my speculative narrative reconstruction) tripped up the original logic, making TeamPolicy always choose PreferShared and never L1, which was not originally intended.

The fix, IMO, should simply be: if the user (not the backend) asks for 0 shared memory, then definitely PreferL1. For user allocations >0, there is some nuance, but the common use case is 0.

Edit for context: P100 has separate shared and L1 memory, so the preference does, AFAIK, exactly nothing on P100. But V100 unifies the shared and L1 memories and displays clear and highly repeatable performance results depending on the setting.

@ambrad
Copy link
Author

ambrad commented Apr 6, 2019

The attached patch gives good speedup on V100 for two applications, including one that has TeamPolicy-based kernels both requesting and not requested shmem; this patch does the right thing in both cases. It's meant to preserve CudaParallelLaunch's current behavior unless an optional new argument is provided to it. It's conservative in the sense that it does not change parallel_reduce or scan, only for.

kokkos-L1-vs-shared-patch.txt

@crtrott crtrott added Blocks Promotion Overview issue for release-blocking bugs Bug Broken / incorrect code; it could be Kokkos' responsibility, or others’ (e.g., Trilinos) and removed Feature Request Create new capability; will potentially require voting labels Apr 7, 2019
@crtrott
Copy link
Member

crtrott commented Apr 7, 2019

I designate this a performance bug. I know roughly when this went astray - its just that performance testing on P100 didn't uncover this :-(.

@crtrott crtrott changed the title Caller control of cudaFuncSetCacheConfig cudaFuncSetCacheConfig is setting CachePreferShared too often May 13, 2019
@ndellingwood ndellingwood removed the Blocks Promotion Overview issue for release-blocking bugs label Jun 24, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Broken / incorrect code; it could be Kokkos' responsibility, or others’ (e.g., Trilinos)
Projects
None yet
Development

No branches or pull requests

5 participants