Rework CUDA cache config using carveout calculations #5706

crtrott · 2022-12-19T20:04:54Z

This will generally rely on the CUDA runtime to figure out best cache configuration, without us setting it manually anymore. We only set it manually if and when someone requests restricted occupancy. However, I did test the carveout scheme when always manually setting, and it did basically replicate the performance obtained by not setting anything for LAMMPS - and beats what we had before which was more limited. I also wrote a test code to check that the occupancy limitation works more or less as expected. Note that this is fairly iffy since there are many moving parts.

#include <Kokkos_Core.hpp>
#include <cmath>

void check_range(int N, int occupancy, int work, int size) {
  Kokkos::View<int> count("C");
  Kokkos::View<double**> workarray("A",N,size);

  auto lambda = KOKKOS_LAMBDA(int i, int& lmax) {
    int id = Kokkos::atomic_fetch_add(&count(), 1);
    for(int j=1; j<work; j++) workarray(i,j%size) += workarray(i, (j-1)%size) + j;
    Kokkos::atomic_add(&count(), -1);
    if(id > lmax) lmax = id;
  };

  Kokkos::RangePolicy<> p(0,N);
  auto occ_p = Kokkos::Experimental::prefer(p, Kokkos::Experimental::DesiredOccupancy{occupancy});
  int max_val;
  Kokkos::parallel_reduce(occ_p, lambda, Kokkos::Max<int>(max_val));
  printf("Occ: %i Max: %i\n",occupancy, max_val);
}

void check_team(int N, int occupancy, int work, int size, int team_size, int shmem) {
  Kokkos::View<int> count("C");
  Kokkos::View<double**> workarray("A",N,size);

  auto lambda = KOKKOS_LAMBDA(const typename Kokkos::TeamPolicy<>::member_type& team, int& lmax) {
    int i = team.league_rank()*team_size+team.team_rank();
    int id = Kokkos::atomic_fetch_add(&count(), 1);
    for(int j=1; j<work; j++) workarray(i,j%size) += workarray(i, (j-1)%size) + j;
    Kokkos::atomic_add(&count(), -1);
    if(id > lmax) lmax = id;
  };

  Kokkos::TeamPolicy<> p(N/team_size, team_size);
  p.set_scratch_size(0, Kokkos::PerTeam(shmem));
  auto occ_p = Kokkos::Experimental::prefer(p, Kokkos::Experimental::DesiredOccupancy{occupancy});
  int max_val;
  Kokkos::parallel_reduce(occ_p, lambda, Kokkos::Max<int>(max_val));
  printf("Occ: %i Max: %i\n",occupancy, max_val);
}
int main(int argc, char* argv[]) {
  Kokkos::initialize(argc, argv);
  {
    int N = argc > 1 ? atoi(argv[1]) : 1000000;
    int work = argc > 2 ? atoi(argv[2]) : 10000;
    int size = argc > 3 ? atoi(argv[3]) : 100;
    int occupancy = argc > 4 ? atoi(argv[4]) : 100;
    int team_size = argc > 5 ? atoi(argv[5]) : 128;
    int shmem = argc > 6 ? atoi(argv[6]) : 0;
    check_range(N,occupancy,work,size);
    check_team(N, occupancy, work, size, team_size, shmem);
  }
  Kokkos::finalize();
}

We may consider adding this test, however it requires that the GPU is used exclusively.

core/src/Cuda/Kokkos_Cuda_KernelLaunch.hpp

PhilMiller · 2022-12-20T18:18:18Z

Tagged as blocking in lieu of #4295

core/src/Cuda/Kokkos_Cuda_BlockSize_Deduction.hpp

core/src/Cuda/Kokkos_Cuda_KernelLaunch.hpp

dalg24 · 2022-12-22T03:52:53Z

Make sure you add a description

crtrott · 2023-01-03T21:40:50Z

Retest this please.

dalg24 · 2023-01-03T22:01:00Z

core/src/Cuda/Kokkos_Cuda_KernelLaunch.hpp

+inline void configure_shmem_preference(const KernelFuncPtr& func,
+                                       const cudaDeviceProp& device_props,
+                                       const size_t block_size, int& shmem,
+                                       const size_t occupancy) {


I wish you did not interleave out and in parameters...

dalg24 · 2023-01-03T22:16:03Z

core/src/Cuda/Kokkos_Cuda_KernelLaunch.hpp

+  size_t num_blocks_desired =
+      (num_threads_desired + block_size * 0.8) / block_size;


Avoid "magic" numbers. Give it a name.

null_point_eight?

dalg24 reviewed Dec 20, 2022

View reviewed changes

crtrott force-pushed the cuda-cache-config3 branch from 1743a85 to 937fa4f Compare December 20, 2022 16:32

Rework CUDA cache config using carveout calculations

7de2b62

crtrott force-pushed the cuda-cache-config3 branch from 937fa4f to 7de2b62 Compare December 20, 2022 16:35

PhilMiller added the Blocks Promotion Overview issue for release-blocking bugs label Dec 20, 2022

PhilMiller added the Backend - CUDA label Dec 20, 2022

PhilMiller mentioned this pull request Dec 20, 2022

Cuda: Adding implementation to use cudaFuncCachePreferEqual when needed. #4295

Closed

PhilMiller reviewed Dec 20, 2022

View reviewed changes

core/src/Cuda/Kokkos_Cuda_BlockSize_Deduction.hpp Show resolved Hide resolved

PhilMiller reviewed Dec 20, 2022

View reviewed changes

core/src/Cuda/Kokkos_Cuda_KernelLaunch.hpp Show resolved Hide resolved

masterleinad reviewed Dec 20, 2022

View reviewed changes

core/src/Cuda/Kokkos_Cuda_KernelLaunch.hpp Outdated Show resolved Hide resolved

crtrott force-pushed the cuda-cache-config3 branch 2 times, most recently from 0dbd2fa to b387701 Compare December 21, 2022 23:17

PhilMiller reviewed Dec 22, 2022

View reviewed changes

core/src/Cuda/Kokkos_Cuda_KernelLaunch.hpp Outdated Show resolved Hide resolved

PhilMiller reviewed Dec 22, 2022

View reviewed changes

core/src/Cuda/Kokkos_Cuda_KernelLaunch.hpp Outdated Show resolved Hide resolved

PhilMiller reviewed Dec 22, 2022

View reviewed changes

core/src/Cuda/Kokkos_Cuda_KernelLaunch.hpp Outdated Show resolved Hide resolved

PhilMiller approved these changes Dec 22, 2022

View reviewed changes

dalg24 reviewed Dec 22, 2022

View reviewed changes

core/src/Cuda/Kokkos_Cuda_KernelLaunch.hpp Outdated Show resolved Hide resolved

core/src/Cuda/Kokkos_Cuda_KernelLaunch.hpp Outdated Show resolved Hide resolved

CudaCacheConfig update address review comments

119d29d

crtrott force-pushed the cuda-cache-config3 branch from b387701 to 119d29d Compare December 22, 2022 17:00

PhilMiller added the CHANGELOG Item to be included in release CHANGELOG label Dec 24, 2022

dalg24 approved these changes Jan 3, 2023

View reviewed changes

crtrott merged commit b3793a7 into kokkos:develop Jan 3, 2023

crtrott deleted the cuda-cache-config3 branch January 3, 2023 23:55

dalg24 mentioned this pull request Jan 6, 2023

Cleanup CUDA block size deduction #5731

Merged

crtrott mentioned this pull request Feb 21, 2023

CHANGELOG: 4.0 #5439

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rework CUDA cache config using carveout calculations #5706

Rework CUDA cache config using carveout calculations #5706

crtrott commented Dec 19, 2022 •

edited by PhilMiller

PhilMiller commented Dec 20, 2022

dalg24 commented Dec 22, 2022

crtrott commented Jan 3, 2023

dalg24 Jan 3, 2023

dalg24 Jan 3, 2023

crtrott Jan 3, 2023

		size_t num_blocks_desired =
		(num_threads_desired + block_size * 0.8) / block_size;

Rework CUDA cache config using carveout calculations #5706

Rework CUDA cache config using carveout calculations #5706

Conversation

crtrott commented Dec 19, 2022 • edited by PhilMiller

PhilMiller commented Dec 20, 2022

dalg24 commented Dec 22, 2022

crtrott commented Jan 3, 2023

dalg24 Jan 3, 2023

Choose a reason for hiding this comment

dalg24 Jan 3, 2023

Choose a reason for hiding this comment

crtrott Jan 3, 2023

Choose a reason for hiding this comment

crtrott commented Dec 19, 2022 •

edited by PhilMiller