Don't use constant cache for lock arrays, enable once per run update instead of once per call #1385

crtrott · 2018-02-02T17:06:12Z

Based on the discussion #1375 I checked what happens if you don't use constant cache for the lock arrays. Basically we loose something like 2% in a "big atomics" benchmark (kokkos/benchmarks/atomics using ./test.cuda 100000 100000 100 1000 1 100 5) both on Pascal and Kepler. But in miniMD for a small test I get the same performance as with RDC now (that test doesn't need the lock arrays). Being able to do this is based on the fact that device symbols have different scope semantics than device constant symbols according to discussions with NVIDIA folks.

The text was updated successfully, but these errors were encountered:

ibaned · 2018-02-02T17:35:46Z

@crtrott let me get this straight. I think these are true?

The scope of __device__ __constant__ is per-kernel-launch
The scope of __device__ is per translation unit

If so, some questions:

Doesn't this mean we still have to copy the lock arrays every time two consecutively executed kernels are in different translation units?
Or is it a bit better where we only have to do the copy if the kernel being run is from a translation unit whose kernels have not been run yet?
Also, doesn't this create one global variable per translation unit? Is that at all problematic in codes consisting of thousands of translation units?

crtrott · 2018-02-02T21:20:52Z

Yes right on every front. And no this is not an issue as long as we dont hit a hundred million translation units, and while I am hesitant to put anything beyond our capacity to produce stupendously complex software I believe even we wont reach that ;-)

crtrott · 2018-02-02T21:23:19Z

To answer the other question: we neex to update the first time we call a kernel in a guven translation unit. Thats what I tried to do though actually thinking about it i might do it more often than that (i.e. once per kernel) but either should solve our performance issue.

ibaned · 2018-02-02T21:24:54Z

Okay, so basically each translation unit will have an initialize, but it is persistent so if we visit the same translation unit twice it won't re-copy the arrays. That sounds pretty acceptable to me. Can we get a PR for this by the February milestone? I can help if needed.

crtrott · 2018-02-02T22:47:44Z

The PR is already there. You just need to approve it 👍

ibaned · 2018-02-02T22:53:12Z

@crtrott awesome, you're fast. For bookkeeping purposes, here is the PR: #1386

mhoemmen · 2018-02-02T23:10:38Z

@rrdrake @prwolfe

Address issue #1385 not using __constant__ for lock arrays on CUDA

crtrott added the Enhancement Improve existing capability; will potentially require voting label Feb 2, 2018

crtrott self-assigned this Feb 2, 2018

crtrott added a commit that referenced this issue Feb 2, 2018

Address issue #1385 not using __constant__ for lock arrays on CUDA

7cbb858

crtrott mentioned this issue Feb 2, 2018

Provide a way to opt out of CUDA lock arrays without RDC #1375

Closed

crtrott added a commit that referenced this issue Feb 4, 2018

Merge pull request #1386 from kokkos/issue-1385

a814c28

Address issue #1385 not using __constant__ for lock arrays on CUDA

crtrott added this to the 2018 February milestone Feb 4, 2018

crtrott added the InDevelop label Feb 4, 2018

ndellingwood closed this as completed Mar 7, 2018

ndellingwood mentioned this issue Mar 8, 2018

Kokkos + KokkosKernels Promotion To Version 2.6.00 trilinos/Trilinos#2351

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't use constant cache for lock arrays, enable once per run update instead of once per call #1385

Don't use constant cache for lock arrays, enable once per run update instead of once per call #1385

crtrott commented Feb 2, 2018

ibaned commented Feb 2, 2018

crtrott commented Feb 2, 2018

crtrott commented Feb 2, 2018

ibaned commented Feb 2, 2018

crtrott commented Feb 2, 2018

ibaned commented Feb 2, 2018

mhoemmen commented Feb 2, 2018

Don't use __constant__ cache for lock arrays, enable once per run update instead of once per call #1385

Don't use __constant__ cache for lock arrays, enable once per run update instead of once per call #1385

Comments

crtrott commented Feb 2, 2018

ibaned commented Feb 2, 2018

crtrott commented Feb 2, 2018

crtrott commented Feb 2, 2018

ibaned commented Feb 2, 2018

crtrott commented Feb 2, 2018

ibaned commented Feb 2, 2018

mhoemmen commented Feb 2, 2018

Don't use constant cache for lock arrays, enable once per run update instead of once per call #1385

Don't use constant cache for lock arrays, enable once per run update instead of once per call #1385