Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework CUDA cache config using carveout calculations #5706

Merged
merged 2 commits into from
Jan 3, 2023

Conversation

crtrott
Copy link
Member

@crtrott crtrott commented Dec 19, 2022

This will generally rely on the CUDA runtime to figure out best cache configuration, without us setting it manually anymore. We only set it manually if and when someone requests restricted occupancy. However, I did test the carveout scheme when always manually setting, and it did basically replicate the performance obtained by not setting anything for LAMMPS - and beats what we had before which was more limited. I also wrote a test code to check that the occupancy limitation works more or less as expected. Note that this is fairly iffy since there are many moving parts.

#include <Kokkos_Core.hpp>
#include <cmath>

void check_range(int N, int occupancy, int work, int size) {
  Kokkos::View<int> count("C");
  Kokkos::View<double**> workarray("A",N,size);

  auto lambda = KOKKOS_LAMBDA(int i, int& lmax) {
    int id = Kokkos::atomic_fetch_add(&count(), 1);
    for(int j=1; j<work; j++) workarray(i,j%size) += workarray(i, (j-1)%size) + j;
    Kokkos::atomic_add(&count(), -1);
    if(id > lmax) lmax = id;
  };

  Kokkos::RangePolicy<> p(0,N);
  auto occ_p = Kokkos::Experimental::prefer(p, Kokkos::Experimental::DesiredOccupancy{occupancy});
  int max_val;
  Kokkos::parallel_reduce(occ_p, lambda, Kokkos::Max<int>(max_val));
  printf("Occ: %i Max: %i\n",occupancy, max_val);
}

void check_team(int N, int occupancy, int work, int size, int team_size, int shmem) {
  Kokkos::View<int> count("C");
  Kokkos::View<double**> workarray("A",N,size);

  auto lambda = KOKKOS_LAMBDA(const typename Kokkos::TeamPolicy<>::member_type& team, int& lmax) {
    int i = team.league_rank()*team_size+team.team_rank();
    int id = Kokkos::atomic_fetch_add(&count(), 1);
    for(int j=1; j<work; j++) workarray(i,j%size) += workarray(i, (j-1)%size) + j;
    Kokkos::atomic_add(&count(), -1);
    if(id > lmax) lmax = id;
  };

  Kokkos::TeamPolicy<> p(N/team_size, team_size);
  p.set_scratch_size(0, Kokkos::PerTeam(shmem));
  auto occ_p = Kokkos::Experimental::prefer(p, Kokkos::Experimental::DesiredOccupancy{occupancy});
  int max_val;
  Kokkos::parallel_reduce(occ_p, lambda, Kokkos::Max<int>(max_val));
  printf("Occ: %i Max: %i\n",occupancy, max_val);
}
int main(int argc, char* argv[]) {
  Kokkos::initialize(argc, argv);
  {
    int N = argc > 1 ? atoi(argv[1]) : 1000000;
    int work = argc > 2 ? atoi(argv[2]) : 10000;
    int size = argc > 3 ? atoi(argv[3]) : 100;
    int occupancy = argc > 4 ? atoi(argv[4]) : 100;
    int team_size = argc > 5 ? atoi(argv[5]) : 128;
    int shmem = argc > 6 ? atoi(argv[6]) : 0;
    check_range(N,occupancy,work,size);
    check_team(N, occupancy, work, size, team_size, shmem);
  }
  Kokkos::finalize();
}

We may consider adding this test, however it requires that the GPU is used exclusively.

core/src/Cuda/Kokkos_Cuda_KernelLaunch.hpp Outdated Show resolved Hide resolved
core/src/Cuda/Kokkos_Cuda_KernelLaunch.hpp Show resolved Hide resolved
core/src/Cuda/Kokkos_Cuda_KernelLaunch.hpp Outdated Show resolved Hide resolved
core/src/Cuda/Kokkos_Cuda_KernelLaunch.hpp Outdated Show resolved Hide resolved
core/src/Cuda/Kokkos_Cuda_KernelLaunch.hpp Outdated Show resolved Hide resolved
core/src/Cuda/Kokkos_Cuda_KernelLaunch.hpp Outdated Show resolved Hide resolved
@PhilMiller PhilMiller added the Blocks Promotion Overview issue for release-blocking bugs label Dec 20, 2022
@PhilMiller
Copy link
Contributor

Tagged as blocking in lieu of #4295

@crtrott crtrott force-pushed the cuda-cache-config3 branch 2 times, most recently from 0dbd2fa to b387701 Compare December 21, 2022 23:17
core/src/Cuda/Kokkos_Cuda_KernelLaunch.hpp Outdated Show resolved Hide resolved
core/src/Cuda/Kokkos_Cuda_KernelLaunch.hpp Outdated Show resolved Hide resolved
@dalg24
Copy link
Member

dalg24 commented Dec 22, 2022

Make sure you add a description

@PhilMiller PhilMiller added the CHANGELOG Item to be included in release CHANGELOG label Dec 24, 2022
@crtrott
Copy link
Member Author

crtrott commented Jan 3, 2023

Retest this please.

inline void configure_shmem_preference(const KernelFuncPtr& func,
const cudaDeviceProp& device_props,
const size_t block_size, int& shmem,
const size_t occupancy) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wish you did not interleave out and in parameters...

Comment on lines +174 to +175
size_t num_blocks_desired =
(num_threads_desired + block_size * 0.8) / block_size;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoid "magic" numbers. Give it a name.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

null_point_eight?

@crtrott crtrott merged commit b3793a7 into kokkos:develop Jan 3, 2023
@crtrott crtrott deleted the cuda-cache-config3 branch January 3, 2023 23:55
@crtrott crtrott mentioned this pull request Feb 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Backend - CUDA Blocks Promotion Overview issue for release-blocking bugs CHANGELOG Item to be included in release CHANGELOG
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants