Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't use __constant__ cache for lock arrays, enable once per run update instead of once per call #1385

Closed
crtrott opened this issue Feb 2, 2018 · 7 comments
Assignees
Labels
Enhancement Improve existing capability; will potentially require voting
Milestone

Comments

@crtrott
Copy link
Member

crtrott commented Feb 2, 2018

Based on the discussion #1375 I checked what happens if you don't use constant cache for the lock arrays. Basically we loose something like 2% in a "big atomics" benchmark (kokkos/benchmarks/atomics using ./test.cuda 100000 100000 100 1000 1 100 5) both on Pascal and Kepler. But in miniMD for a small test I get the same performance as with RDC now (that test doesn't need the lock arrays). Being able to do this is based on the fact that device symbols have different scope semantics than device constant symbols according to discussions with NVIDIA folks.

@crtrott crtrott added the Enhancement Improve existing capability; will potentially require voting label Feb 2, 2018
@crtrott crtrott self-assigned this Feb 2, 2018
@ibaned
Copy link
Contributor

ibaned commented Feb 2, 2018

@crtrott let me get this straight. I think these are true?

  • The scope of __device__ __constant__ is per-kernel-launch
  • The scope of __device__ is per translation unit

If so, some questions:

  • Doesn't this mean we still have to copy the lock arrays every time two consecutively executed kernels are in different translation units?
  • Or is it a bit better where we only have to do the copy if the kernel being run is from a translation unit whose kernels have not been run yet?
  • Also, doesn't this create one global variable per translation unit? Is that at all problematic in codes consisting of thousands of translation units?

@crtrott
Copy link
Member Author

crtrott commented Feb 2, 2018

Yes right on every front. And no this is not an issue as long as we dont hit a hundred million translation units, and while I am hesitant to put anything beyond our capacity to produce stupendously complex software I believe even we wont reach that ;-)

@crtrott
Copy link
Member Author

crtrott commented Feb 2, 2018

To answer the other question: we neex to update the first time we call a kernel in a guven translation unit. Thats what I tried to do though actually thinking about it i might do it more often than that (i.e. once per kernel) but either should solve our performance issue.

@ibaned
Copy link
Contributor

ibaned commented Feb 2, 2018

Okay, so basically each translation unit will have an initialize, but it is persistent so if we visit the same translation unit twice it won't re-copy the arrays. That sounds pretty acceptable to me. Can we get a PR for this by the February milestone? I can help if needed.

@crtrott
Copy link
Member Author

crtrott commented Feb 2, 2018

The PR is already there. You just need to approve it 👍

@ibaned
Copy link
Contributor

ibaned commented Feb 2, 2018

@crtrott awesome, you're fast. For bookkeeping purposes, here is the PR: #1386

@mhoemmen
Copy link
Contributor

mhoemmen commented Feb 2, 2018

@rrdrake @prwolfe

crtrott added a commit that referenced this issue Feb 4, 2018
Address issue #1385 not using __constant__ for lock arrays on CUDA
@crtrott crtrott added this to the 2018 February milestone Feb 4, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Improve existing capability; will potentially require voting
Projects
None yet
Development

No branches or pull requests

4 participants