Adding occupancy tuning for CUDA architectures #6788

khuck · 2024-02-06T22:13:22Z

The old Kokkos fork/branch from :
https://github.com/DavidPoliakoff/kokkos/tree/feature/tune-occupancy
was merged with current Kokkos develop, and tested with ArborX to
confirm that autotuning occupancy for the DBSCAN benchmark worked.
In tests on a system with V100, the original benchmark when iterated
600 times took 119.064 seconds to run. During the tuning process
(using simulated annealing), the runtime was 108.014 seconds.
When using cached results, the runtime was 109.058 seconds. The
converged occupancy value was 70. Here are the cached results
from APEX autotuning:

Input_1:
  name: kokkos.kernel_name
  id: 1
  info.type: string
  info.category: categorical
  info.valueQuantity: unbounded
  info.candidates: unbounded
  num_bins: 0
Input_2:
  name: kokkos.kernel_type
  id: 2
  info.type: string
  info.category: categorical
  info.valueQuantity: set
  info.candidates: [parallel_for,parallel_reduce,parallel_scan,parallel_copy]
Output_3:
  name: ArborX::Experimental::HalfTraversal
  id: 3
  info.type: int64
  info.category: ratio
  info.valueQuantity: range
  info.candidates:
    lower: 5
    upper: 100
    step: 5
    open upper: 0
    open lower: 0
Context_0:
  Name: "[2:parallel_for,1:ArborX::Experimental::HalfTraversal,tree_node:default]"
  Converged: true
  Results:
    NumVars: 1
    id: 3
    value: 70

In manual experiments, the ArborX team determined that the optimal
occupancy for this example was beetween 40-90, which were a 10%
improvement over baseline default of 100. See arborx/ArborX#815
for details.

One deviation from the branch that David had written - I set the occupancy
range to [5-100], with a step size of 5. The original implementation
in Kokkos used [1-100] with a step size of 1. The change was to reduce the search space
and allow faster convergence.

For anything that works, all credit goes to @DavidPoliakoff

masterleinad · 2024-02-07T13:50:58Z

core/src/Cuda/Kokkos_Cuda_Instance.hpp

@@ -450,7 +450,7 @@ class CudaInternal {
  cudaError_t cuda_func_set_attributes_wrapper(T* entry, cudaFuncAttribute attr,
                                               int value) const {
    if constexpr (setCudaDevice) set_cuda_device();
-    return cudaFuncSetAttributes(entry, attr, value);


Fixed in #6786.

does that mean I should revert this fix in my PR?

Just rebase it on top of develop.

masterleinad · 2024-02-07T14:10:12Z

core/src/impl/Kokkos_Profiling.hpp

+#include <Kokkos_Core_fwd.hpp>
+#include <Kokkos_ExecPolicy.hpp>
+#include <Kokkos_Macros.hpp>
+#include <Kokkos_Tuners.hpp>


Needs

Suggested change

#include <Kokkos_Core_fwd.hpp>

#include <Kokkos_ExecPolicy.hpp>

#include <Kokkos_Macros.hpp>

#include <Kokkos_Tuners.hpp>

#ifndef KOKKOS_IMPL_PUBLIC_INCLUDE

#define KOKKOS_IMPL_PUBLIC_INCLUDE

#define KOKKOS_IMPL_PUBLIC_INCLUDE_NOTDEFINED_PROFILING

#endif

#include <Kokkos_Core_fwd.hpp>

#include <Kokkos_ExecPolicy.hpp>

#include <Kokkos_Macros.hpp>

#include <Kokkos_Tuners.hpp>

and

#ifdef KOKKOS_IMPL_PUBLIC_INCLUDE_NOTDEFINED_PROFILING #undef KOKKOS_IMPL_PUBLIC_INCLUDE #undef KOKKOS_IMPL_PUBLIC_INCLUDE_NOTDEFINED_PROFILING #endif

at the end of the file since we are now including non-public headers.

Rombur · 2024-03-04T20:19:05Z

Something is strange with PR. 87 files have been changed and my browser crashed when I tried to look at the difference. Can you try to rebase on develop once again.

khuck · 2024-03-04T20:20:49Z

@Rombur yes, I noticed the same when I rebased for #6786 as suggested in the first comment... I'll see what I can do to try to clean this up.

Note: This is a re-commit of a somehow polluted branch when I rebased on develop. I started over with the 5 changed files. The old Kokkos fork/branch from : davidp git@github.com:DavidPoliakoff/kokkos.git (fetch) was merged with current Kokkos develop, and tested with ArborX to confirm that autotuning occupancy for the DBSCAN benchmark worked. In tests on a system with V100, the original benchmark when iterated 600 times took 119.064 seconds to run. During the tuning process (using simulated annealing), the runtime was 108.014 seconds. When using cached results, the runtime was 109.058 seconds. The converged occupancy value was 70. Here are the cached results from APEX autotuning: Input_1: name: kokkos.kernel_name id: 1 info.type: string info.category: categorical info.valueQuantity: unbounded info.candidates: unbounded num_bins: 0 Input_2: name: kokkos.kernel_type id: 2 info.type: string info.category: categorical info.valueQuantity: set info.candidates: [parallel_for,parallel_reduce,parallel_scan,parallel_copy] Output_3: name: ArborX::Experimental::HalfTraversal id: 3 info.type: int64 info.category: ratio info.valueQuantity: range info.candidates: lower: 5 upper: 100 step: 5 open upper: 0 open lower: 0 Context_0: Name: "[2:parallel_for,1:ArborX::Experimental::HalfTraversal,tree_node:default]" Converged: true Results: NumVars: 1 id: 3 value: 70 In manual experiments, the ArborX team determined that the optimal occupancy for this example was beetween 40-90, which were a 10% improvement over baseline default of 100. See arborx/ArborX#815 for details. One deviation from the branch that David had written - the occupancy range is [5-100], with a step size of 5. The original implementation in Kokkos used [1-100] with a step size of 1.

khuck · 2024-03-11T17:42:02Z

@Rombur @masterleinad I re-committed the branch on the current develop, there shouldn't be a corrupted commit history now. Let me know if you need any other changes. Thanks!

Rombur · 2024-03-14T14:03:08Z

The two configurations that have TUNING enable are failing. I'll try to look at the issue when I have time.

khuck · 2024-03-14T17:14:12Z

The two failures I see are related to something specific due to linking fortran on fedora and clang, I have no idea what that's about... I was able to reproduce it with a docker image of my own but I don't think it's related?

masterleinad · 2024-03-14T17:28:23Z

I was able to reproduce it with a docker image of my own but I don't think it's related?

How did you reproduce? What images were you using?

Rombur · 2024-03-14T17:36:50Z

@khuck I am talking about this and this
In the CUDA case, the relevant error is:

error: too many errors emitted, stopping now [clang-diagnostic-error]
/var/jenkins/workspace/Kokkos_PR-6788/core/src/Cuda/Kokkos_Cuda_Parallel_Team.hpp:622:48: error: no type named 'pointer_type' in 'Kokkos::BOr<unsigned char>' [clang-diagnostic-error]
  using pointer_type   = typename ReducerType::pointer_type;
                                               ^
/var/jenkins/workspace/Kokkos_PR-6788/core/src/Cuda/Kokkos_Cuda_Parallel_Team.hpp:368:51: note: in instantiation of template class 'Kokkos::Impl::ParallelReduce<Kokkos::Impl::CombinedFunctorReducer<Test::(anonymous namespace)::TestTeamBroadcast<Kokkos::Cuda, Kokkos::Schedule<Kokkos::Static>, unsigned char>, Kokkos::BOr<unsigned char>>, Kokkos::TeamPolicy<Kokkos::Schedule<Kokkos::Static>, Kokkos::Cuda>, Kokkos::Cuda>' requested here
        Impl::FunctorAnalysis<Interface, typename ClosureType::Policy,
                                                  ^
/var/jenkins/workspace/Kokkos_PR-6788/core/src/Cuda/Kokkos_Cuda_Parallel_Team.hpp:391:12: note: in instantiation of function template specialization 'Kokkos::Impl::TeamPolicyInternal<Kokkos::Cuda, Kokkos::Schedule<Kokkos::Static>, Kokkos::Cuda>::internal_team_size_common<Kokkos::Impl::ParallelReduce<Kokkos::Impl::CombinedFunctorReducer<Test::(anonymous namespace)::TestTeamBroadcast<Kokkos::Cuda, Kokkos::Schedule<Kokkos::Static>, unsigned char>, Kokkos::BOr<unsigned char>>, Kokkos::TeamPolicy<Kokkos::Schedule<Kokkos::Static>, Kokkos::Cuda>, Kokkos::Cuda>, Test::(anonymous namespace)::TestTeamBroadcast<Kokkos::Cuda, Kokkos::Schedule<Kokkos::Static>, unsigned char>, int (&)(const Kokkos::Impl::CudaInternal *, const cudaFuncAttributes &, const Test::(anonymous namespace)::TestTeamBroadcast<Kokkos::Cuda, Kokkos::Schedule<Kokkos::Static>, unsigned char> &, unsigned long, unsigned long, unsigned long)>' requested here
    return internal_team_size_common<ClosureType>(
           ^
/var/jenkins/workspace/Kokkos_PR-6788/core/src/Cuda/Kokkos_Cuda_Parallel_Team.hpp:131:12: note: in instantiation of function template specialization 'Kokkos::Impl::TeamPolicyInternal<Kokkos::Cuda, Kokkos::Schedule<Kokkos::Static>, Kokkos::Cuda>::internal_team_size_max<Kokkos::Impl::ParallelReduce<Kokkos::Impl::CombinedFunctorReducer<Test::(anonymous namespace)::TestTeamBroadcast<Kokkos::Cuda, Kokkos::Schedule<Kokkos::Static>, unsigned char>, Kokkos::BOr<unsigned char>>, Kokkos::TeamPolicy<Kokkos::Schedule<Kokkos::Static>, Kokkos::Cuda>, Kokkos::Cuda>, Test::(anonymous namespace)::TestTeamBroadcast<Kokkos::Cuda, Kokkos::Schedule<Kokkos::Static>, unsigned char>>' requested here
    return internal_team_size_max<closure_type>(f);
           ^
/var/jenkins/workspace/Kokkos_PR-6788/core/src/impl/Kokkos_Profiling.hpp:373:19: note: in instantiation of function template specialization 'Kokkos::Impl::TeamPolicyInternal<Kokkos::Cuda, Kokkos::Schedule<Kokkos::Static>, Kokkos::Cuda>::team_size_max<Test::(anonymous namespace)::TestTeamBroadcast<Kokkos::Cuda, Kokkos::Schedule<Kokkos::Static>, unsigned char>, Kokkos::BOr<unsigned char>>' requested here
    return policy.team_size_max(functor, reducer_example, tag);
                  ^
/var/jenkins/workspace/Kokkos_PR-6788/core/src/Kokkos_Tuners.hpp:479:35: note: in instantiation of function template specialization 'Kokkos::Tools::Impl::Impl::ComplexReducerSizeCalculator<Kokkos::BOr<unsigned char>>::get_max_team_size<Kokkos::TeamPolicy<Kokkos::Schedule<Kokkos::Static>, Kokkos::Cuda>, Test::(anonymous namespace)::TestTeamBroadcast<Kokkos::Cuda, Kokkos::Schedule<Kokkos::Static>, unsigned char>, Kokkos::ParallelReduceTag>' requested here
        auto max_team_size = calc.get_max_team_size(policy, functor, tag);
                                  ^
/var/jenkins/workspace/Kokkos_PR-6788/core/src/impl/Kokkos_Profiling.hpp:448:24: note: in instantiation of function template specialization 'Kokkos::Tools::Experimental::TeamSizeTuner::TeamSizeTuner<Kokkos::Tools::Impl::Impl::ComplexReducerSizeCalculator<Kokkos::BOr<unsigned char>>, Test::(anonymous namespace)::TestTeamBroadcast<Kokkos::Cuda, Kokkos::Schedule<Kokkos::Static>, unsigned char>, Kokkos::ParallelReduceTag, Kokkos::Schedule<Kokkos::Static>, Kokkos::Cuda>' requested here
                       Tuner(label, policy, functor, tag,
                       ^
/var/jenkins/workspace/Kokkos_PR-6788/core/src/Kokkos_Parallel_Reduce.hpp:1535:5: note: in instantiation of member function 'Kokkos::Impl::ParallelReduceAdaptor<Kokkos::TeamPolicy<Kokkos::Schedule<Kokkos::Static>, Kokkos::Cuda>, Test::(anonymous namespace)::TestTeamBroadcast<Kokkos::Cuda, Kokkos::Schedule<Kokkos::Static>, unsigned char>, Kokkos::BOr<unsigned char>>::execute_impl' requested here
    execute_impl(label, policy, functor, return_value);
    ^
/var/jenkins/workspace/Kokkos_PR-6788/core/src/Kokkos_Parallel_Reduce.hpp:1771:69: note: in instantiation of function template specialization 'Kokkos::Impl::ParallelReduceAdaptor<Kokkos::TeamPolicy<Kokkos::Schedule<Kokkos::Static>, Kokkos::Cuda>, Test::(anonymous namespace)::TestTeamBroadcast<Kokkos::Cuda, Kokkos::Schedule<Kokkos::Static>, unsigned char>, Kokkos::BOr<unsigned char>>::execute<Kokkos::BOr<unsigned char>>' requested here
  Impl::ParallelReduceAdaptor<PolicyType, FunctorType, ReturnType>::execute(
                                                                    ^
/var/jenkins/workspace/Kokkos_PR-6788/core/unit_test/TestTeam.hpp:1336:13: note: in instantiation of function template specialization 'Kokkos::parallel_reduce<Kokkos::TeamPolicy<Kokkos::Schedule<Kokkos::Static>, Kokkos::Cuda>, Test::(anonymous namespace)::TestTeamBroadcast<Kokkos::Cuda, Kokkos::Schedule<Kokkos::Static>, unsigned char>, Kokkos::BOr<unsigned char>>' requested here
    Kokkos::parallel_reduce(policy_type(league_size, team_size), functor,
            ^
/var/jenkins/workspace/Kokkos_PR-6788/core/unit_test/TestTeamBasic.hpp:309:39: note: in instantiation of member function 'Test::(anonymous namespace)::TestTeamBroadcast<Kokkos::Cuda, Kokkos::Schedule<Kokkos::Static>, unsigned char>::test_teambroadcast' requested here
                      unsigned char>::test_teambroadcast(0, 1);

khuck · 2024-03-14T18:50:32Z

I was able to reproduce it with a docker image of my own but I don't think it's related?

How did you reproduce? What images were you using?

Oh, I was looking at the wrong failure. I was running the fedora:41 clang++ test, not the last one on the list...

Rombur · 2024-04-10T21:19:02Z

It looks like the issue is with CUDA Clang. The failures are from OPENMPTARGET-CLANG and CUDA-CLANG which both use Clang on Nvidia GPUs. I just tried this PR with CUDA 12.4 and clang 17.0.6 and the PR works. I am going to try with an older compiler stack.

Rombur · 2024-04-10T21:36:50Z

I forgot to turn tuning on... Now I can reproduce the issue.

khuck · 2024-04-10T21:52:07Z

@Rombur are you able to reproduce it with a newer clang and cuda? because it has taken me a while to get clang 12.0.0 and cuda 11 together on the same machine, but I can now reproduce the build error...

khuck · 2024-04-10T22:01:41Z

I forgot to turn tuning on... Now I can reproduce the issue.

Do you think it is something specific to ParallelReduce?

masterleinad · 2024-04-11T03:37:40Z

diff --git a/core/src/Kokkos_Parallel_Reduce.hpp b/core/src/Kokkos_Parallel_Reduce.hpp
index 39cb87acd..c656f72c2 100644
--- a/core/src/Kokkos_Parallel_Reduce.hpp
+++ b/core/src/Kokkos_Parallel_Reduce.hpp
@@ -1492,12 +1492,6 @@ struct ParallelReduceAdaptor {
     using PassedReducerType = typename return_value_adapter::reducer_type;
     uint64_t kpID           = 0;
 
-    /** Request a tuned policy from the tools subsystem */
-    auto response = Kokkos::Tools::Impl::begin_parallel_reduce<
-        typename return_value_adapter::reducer_type>(policy, functor, label,
-                                                     kpID);
-    auto& inner_policy = response.policy;
-
     using ReducerSelector =
         Kokkos::Impl::if_c<std::is_same<InvalidType, PassedReducerType>::value,
                            FunctorType, PassedReducerType>;
@@ -1509,7 +1503,12 @@ struct ParallelReduceAdaptor {
         functor, typename Analysis::Reducer(
                      ReducerSelector::select(functor, return_value)));
 
-    // FIXME Remove "Wrapper" once all backends implement the new interface
+  /** Request a tuned policy from the tools subsystem */
+    auto response = Kokkos::Tools::Impl::begin_parallel_reduce<
+        typename return_value_adapter::reducer_type>(policy, functor_reducer, label,
+                                                     kpID);
+    auto& inner_policy = response.policy;
+
     Impl::ParallelReduce<decltype(functor_reducer), PolicyType,
                          typename Impl::FunctorPolicyExecutionSpace<
                              FunctorType, PolicyType>::execution_space>

should fix it.

Rombur · 2024-04-11T13:31:00Z

@khuck Yes, I can reproduce the error with CUDA 12.4 and clang 17.0.6. @masterleinad patch fixes one issue but there are others.

khuck · 2024-04-11T15:56:26Z

@masterleinad your fix helps, but doesn't fix everything. @Rombur Actually, it isn't just clang + cuda, I can reproduce it with gcc 9.4.0 when building the tests...

khuck · 2024-04-11T16:16:33Z

this also helps some:

diff --git a/core/src/OpenMP/Kokkos_OpenMP_Parallel_Reduce.hpp b/core/src/OpenMP/Kokkos_OpenMP_Parallel_Reduce.hpp
index 05fd1c9dc..6a52b5ec2 100644
--- a/core/src/OpenMP/Kokkos_OpenMP_Parallel_Reduce.hpp
+++ b/core/src/OpenMP/Kokkos_OpenMP_Parallel_Reduce.hpp
@@ -34,6 +34,8 @@ class ParallelReduce<CombinedFunctorReducerType, Kokkos::RangePolicy<Traits...>,
   using Policy      = Kokkos::RangePolicy<Traits...>;
   using FunctorType = typename CombinedFunctorReducerType::functor_type;
   using ReducerType = typename CombinedFunctorReducerType::reducer_type;
+  using functor_type   = FunctorType;
+  using reducer_type   = ReducerType;
 
   using WorkTag = typename Policy::work_tag;
   using Member  = typename Policy::member_type;
@@ -191,6 +193,8 @@ class ParallelReduce<CombinedFunctorReducerType,
   using Policy        = typename MDRangePolicy::impl_range_policy;
   using FunctorType   = typename CombinedFunctorReducerType::functor_type;
   using ReducerType   = typename CombinedFunctorReducerType::reducer_type;
+  using functor_type   = FunctorType;
+  using reducer_type   = ReducerType;
 
   using WorkTag = typename MDRangePolicy::work_tag;
   using Member  = typename Policy::member_type;
@@ -343,6 +347,8 @@ class ParallelReduce<CombinedFunctorReducerType,
       Kokkos::Impl::TeamPolicyInternal<Kokkos::OpenMP, Properties...>;
   using FunctorType = typename CombinedFunctorReducerType::functor_type;
   using ReducerType = typename CombinedFunctorReducerType::reducer_type;
+  using functor_type   = FunctorType;
+  using reducer_type   = ReducerType;
 
   using WorkTag  = typename Policy::work_tag;
   using SchedTag = typename Policy::schedule_type::type;

khuck · 2024-04-11T18:11:55Z

I think I found a clue... the remaining issues seem to be related to the fact that for MDRange reductions, there isn't a CombinedFunctorReducer getting created somewhere, but that's what is expected. Instead, Kokkos::Impl::ParallelReduce<> is getting passed a Functor...

…and OpenMP Reduce tuning. Now trying to fix Team tuning...

khuck · 2024-04-11T22:12:00Z

OK, using @masterleinad 's fix I was able to get past the problems with MDRange. Now I am having an issue with Teams and a simple reduce. When the CombinedFunctorReducer is constructed, the Reducer returned by this call https://github.com/khuck/kokkos/blob/e0dc0128e04f18c2bbbaefceef3616e7ddcfa3c4/core/src/Kokkos_Parallel_Reduce.hpp#L1506-L1508 doesn't have an operator() method:

[ 18%] Building CXX object core/unit_test/CMakeFiles/Kokkos_CoreUnitTest_Cuda1.dir/cuda/TestCuda_Other.cpp.o
/home/users/khuck/src/occupancy/kokkos-khuck/core/src/impl/Kokkos_FunctorAnalysis.hpp(189): error: class "Kokkos::Impl::CombinedFunctorReducer<TestCXX11::FunctorReduceTest<Kokkos::Cuda>, Kokkos::Impl::FunctorAnalysis<Kokkos::Impl::FunctorPatternInterface::REDUCE, Kokkos::TeamPolicy<Kokkos::Cuda>, TestCXX11::FunctorReduceTest<Kokkos::Cuda>, double>::Reducer, void>" has no member "operator()"
          detected during:
            instantiation of class "Kokkos::Impl::FunctorAnalysis<PatternInterface, Policy, Functor, OverrideValueType>::deduce_value_type<F, Kokkos::Impl::FunctorAnalysis<PatternInterface, Policy, Functor, OverrideValueType>::REDUCE, void, true> [with PatternInterface=Kokkos::Impl::FunctorPatternInterface::REDUCE, Policy=Kokkos::Impl::TeamPolicyInternal<Kokkos::Cuda::execution_space, Kokkos::Cuda>, Functor=Kokkos::Impl::CombinedFunctorReducer<TestCXX11::FunctorReduceTest<Kokkos::Cuda>, Kokkos::Impl::FunctorAnalysis<Kokkos::Impl::FunctorPatternInterface::REDUCE, Kokkos::TeamPolicy<Kokkos::Cuda>, TestCXX11::FunctorReduceTest<Kokkos::Cuda>, double>::Reducer, void>, OverrideValueType=void, F=Kokkos::Impl::CombinedFunctorReducer<TestCXX11::FunctorReduceTest<Kokkos::Cuda>, Kokkos::Impl::FunctorAnalysis<Kokkos::Impl::FunctorPatternInterface::REDUCE, Kokkos::TeamPolicy<Kokkos::Cuda>, TestCXX11::FunctorReduceTest<Kokkos::Cuda>, double>::Reducer, void>]" 
(290): here
            instantiation of class "Kokkos::Impl::FunctorAnalysis<PatternInterface, Policy, Functor, OverrideValueType> [with PatternInterface=Kokkos::Impl::FunctorPatternInterface::REDUCE, Policy=Kokkos::Impl::TeamPolicyInternal<Kokkos::Cuda::execution_space, Kokkos::Cuda>, Functor=Kokkos::Impl::CombinedFunctorReducer<TestCXX11::FunctorReduceTest<Kokkos::Cuda>, Kokkos::Impl::FunctorAnalysis<Kokkos::Impl::FunctorPatternInterface::REDUCE, Kokkos::TeamPolicy<Kokkos::Cuda>, TestCXX11::FunctorReduceTest<Kokkos::Cuda>, double>::Reducer, void>, OverrideValueType=void]" 
/home/users/khuck/src/occupancy/kokkos-khuck/core/src/Cuda/Kokkos_Cuda_Parallel_Team.hpp(120): here
            instantiation of "int Kokkos::Impl::TeamPolicyInternal<Kokkos::Cuda, Properties...>::team_size_max(const FunctorType &, const Kokkos::ParallelReduceTag &) const [with Properties=<Kokkos::Cuda>, FunctorType=Kokkos::Impl::CombinedFunctorReducer<TestCXX11::FunctorReduceTest<Kokkos::Cuda>, Kokkos::Impl::FunctorAnalysis<Kokkos::Impl::FunctorPatternInterface::REDUCE, Kokkos::TeamPolicy<Kokkos::Cuda>, TestCXX11::FunctorReduceTest<Kokkos::Cuda>, double>::Reducer, void>]" 
/home/users/khuck/src/occupancy/kokkos-khuck/core/src/impl/Kokkos_Profiling.hpp(330): here
            instantiation of "int Kokkos::Tools::Impl::SimpleTeamSizeCalculator::get_max_team_size(const Policy &, const Functor &, Tag) [with Policy=Kokkos::TeamPolicy<Kokkos::Cuda>, Functor=Kokkos::Impl::CombinedFunctorReducer<TestCXX11::FunctorReduceTest<Kokkos::Cuda>, Kokkos::Impl::FunctorAnalysis<Kokkos::Impl::FunctorPatternInterface::REDUCE, Kokkos::TeamPolicy<Kokkos::Cuda>, TestCXX11::FunctorReduceTest<Kokkos::Cuda>, double>::Reducer, void>, Tag=Kokkos::ParallelReduceTag]" 
/home/users/khuck/src/occupancy/kokkos-khuck/core/src/Kokkos_Tuners.hpp(479): here
            instantiation of "Kokkos::Tools::Experimental::TeamSizeTuner::TeamSizeTuner(const std::string &, const Kokkos::TeamPolicy<Properties...> &, const Functor &, const TagType &, ViableConfigurationCalculator) [with ViableConfigurationCalculator=Kokkos::Tools::Impl::SimpleTeamSizeCalculator, Functor=Kokkos::Impl::CombinedFunctorReducer<TestCXX11::FunctorReduceTest<Kokkos::Cuda>, Kokkos::Impl::FunctorAnalysis<Kokkos::Impl::FunctorPatternInterface::REDUCE, Kokkos::TeamPolicy<Kokkos::Cuda>, TestCXX11::FunctorReduceTest<Kokkos::Cuda>, double>::Reducer, void>, TagType=Kokkos::ParallelReduceTag, Properties=<Kokkos::Cuda>]" 
/home/users/khuck/src/occupancy/kokkos-khuck/core/src/Kokkos_Parallel_Reduce.hpp(1510): here
            instantiation of "void Kokkos::Impl::ParallelReduceAdaptor<PolicyType, FunctorType, ReturnType>::execute_impl(const std::string &, const PolicyType &, const FunctorType &, ReturnType &) [with PolicyType=Kokkos::TeamPolicy<Kokkos::Cuda>, FunctorType=TestCXX11::FunctorReduceTest<Kokkos::Cuda>, ReturnType=Kokkos::View<double, Kokkos::HostSpace, Kokkos::MemoryUnmanaged>]" 
/home/users/khuck/src/occupancy/kokkos-khuck/core/src/Kokkos_Parallel_Reduce.hpp(1535): here
            instantiation of "std::enable_if_t<<expression>, void> Kokkos::Impl::ParallelReduceAdaptor<PolicyType, FunctorType, ReturnType>::execute(const std::string &, const PolicyType &, const FunctorType &, ReturnType &) [with PolicyType=Kokkos::TeamPolicy<Kokkos::Cuda>, FunctorType=TestCXX11::FunctorReduceTest<Kokkos::Cuda>, ReturnType=Kokkos::View<double, Kokkos::HostSpace, Kokkos::MemoryUnmanaged>, Dummy=Kokkos::View<double, Kokkos::HostSpace, Kokkos::MemoryUnmanaged>]" 
/home/users/khuck/src/occupancy/kokkos-khuck/core/src/Kokkos_Parallel_Reduce.hpp(1772): here
            instantiation of "std::enable_if_t<<expression>, void> Kokkos::parallel_reduce(const PolicyType &, const FunctorType &, const ReturnType &) [with PolicyType=Kokkos::TeamPolicy<Kokkos::Cuda>, FunctorType=TestCXX11::FunctorReduceTest<Kokkos::Cuda>, ReturnType=Kokkos::View<double, Kokkos::HostSpace, Kokkos::MemoryUnmanaged>]" 
/home/users/khuck/src/occupancy/kokkos-khuck/core/unit_test/TestCXX11.hpp(220): here
            instantiation of "double TestCXX11::ReduceTestFunctor<DeviceType,PWRTest>() [with DeviceType=Kokkos::Cuda, PWRTest=false]" 
/home/users/khuck/src/occupancy/kokkos-khuck/core/unit_test/TestCXX11.hpp(304): here
            instantiation of "double TestCXX11::TestVariantFunctor<DeviceType>(int) [with DeviceType=Kokkos::Cuda]" 
/home/users/khuck/src/occupancy/kokkos-khuck/core/unit_test/TestCXX11.hpp(314): here
            instantiation of "__nv_bool TestCXX11::Test<DeviceType>(int) [with DeviceType=Kokkos::Cuda]" 
/home/users/khuck/src/occupancy/kokkos-khuck/core/unit_test/TestCXX11.hpp(347): here

masterleinad · 2024-04-12T16:23:23Z

diff --git a/core/src/impl/Kokkos_Profiling.hpp b/core/src/impl/Kokkos_Profiling.hpp
index 04e48454b..be06a5895 100644
--- a/core/src/impl/Kokkos_Profiling.hpp
+++ b/core/src/impl/Kokkos_Profiling.hpp
@@ -329,6 +329,12 @@ struct SimpleTeamSizeCalculator {
                         const Tag tag) {
     auto max = policy.team_size_max(functor, tag);
     return max;
+  }
+    template <typename Policy, typename FunctorReducer>
+  int get_max_team_size(const Policy& policy, const FunctorReducer& functor_reducer,
+                        const Kokkos::ParallelReduceTag tag) {
+    auto max = policy.team_size_max(functor_reducer.get_functor(), functor_reducer.get_reducer(), tag);
+    return max;
   }
   template <typename Policy, typename Functor, typename Tag>
   int get_recommended_team_size(const Policy& policy, const Functor& functor,
@@ -344,13 +350,13 @@ struct SimpleTeamSizeCalculator {
     using driver     = Kokkos::Impl::ParallelFor<Functor, Policy, exec_space>;
     return driver::max_tile_size_product(policy, functor);
   }
-  template <typename Policy, typename Functor>
+  template <typename Policy, typename FunctorReducer>
   int get_mdrange_max_tile_size_product(const Policy& policy,
-                                        const Functor& functor,
+                                        const FunctorReducer& functor_reducer,
                                         const Kokkos::ParallelReduceTag&) {
     using exec_space = typename Policy::execution_space;
-    using driver = Kokkos::Impl::ParallelReduce<Functor, Policy, exec_space>;
-    return driver::max_tile_size_product(policy, functor);
+    using driver = Kokkos::Impl::ParallelReduce<FunctorReducer, Policy, exec_space>;
+    return driver::max_tile_size_product(policy, functor_reducer.get_functor());
   }
 };
 
@@ -360,31 +366,24 @@ struct SimpleTeamSizeCalculator {
 // constructible from a reference to an
 // instance of their value_type so we construct
 // a value_type and temporary reducer here
-template <typename ReducerType>
 struct ComplexReducerSizeCalculator {
-  template <typename Policy, typename Functor, typename Tag>
-  int get_max_team_size(const Policy& policy, const Functor& functor,
+  template <typename Policy, typename FunctorReducer, typename Tag>
+  int get_max_team_size(const Policy& policy, const FunctorReducer& functor_reducer,
                         const Tag tag) {
-    using value_type = typename ReducerType::value_type;
-    value_type value;
-    ReducerType reducer_example = ReducerType(value);
-    return policy.team_size_max(functor, reducer_example, tag);
+    return policy.team_size_max(functor_reducer.get_functor(), functor_reducer.get_reducer(), tag);
   }
-  template <typename Policy, typename Functor, typename Tag>
-  int get_recommended_team_size(const Policy& policy, const Functor& functor,
+  template <typename Policy, typename FunctorReducer, typename Tag>
+  int get_recommended_team_size(const Policy& policy, const FunctorReducer& functor_reducer,
                                 const Tag tag) {
-    using value_type = typename ReducerType::value_type;
-    value_type value;
-    ReducerType reducer_example = ReducerType(value);
-    return policy.team_size_recommended(functor, reducer_example, tag);
+    return policy.team_size_recommended(functor_reducer.get_functor(), functor_reducer.get_reducer(), tag);
   }
-  template <typename Policy, typename Functor>
+  template <typename Policy, typename FunctorReducer>
   int get_mdrange_max_tile_size_product(const Policy& policy,
-                                        const Functor& functor,
+                                        const FunctorReducer& functor_reducer,
                                         const Kokkos::ParallelReduceTag&) {
     using exec_space = typename Policy::execution_space;
-    using driver = Kokkos::Impl::ParallelReduce<Functor, Policy, exec_space>;
-    return driver::max_tile_size_product(policy, functor);
+    using driver = Kokkos::Impl::ParallelReduce<FunctorReducer, Policy, exec_space>;
+    return driver::max_tile_size_product(policy, functor_reducer.get_functor());
   }
 };
 
@@ -442,7 +441,7 @@ auto generic_tune_policy(const std::string& label_in, Map& map,
         return (map.emplace(
                        label,
                        Tuner(label, policy, functor, tag,
-                             Impl::ComplexReducerSizeCalculator<ReducerType>{}))
+                             Impl::ComplexReducerSizeCalculator{}))
                     .first);
       }
       return my_tuner;

should do the trick.

khuck · 2024-04-12T17:23:08Z

@masterleinad yup, that did the trick, thanks! I'm rerunning as many tests as I can think of before I push the change...

masterleinad

Looks pretty good to me. Any chance that we can test this?

masterleinad · 2024-04-15T12:43:15Z

core/src/Kokkos_Tuners.hpp

+};
+}  // namespace Impl
+template <class Bound>
+class SingleDimensionalRangeTuner {


What else do you envision this class to be used for?

masterleinad · 2024-04-15T12:44:03Z

core/src/Kokkos_Tuners.hpp

+  RangePolicyOccupancyTuner& operator=(const RangePolicyOccupancyTuner& other) =
+      default;
+  RangePolicyOccupancyTuner(const RangePolicyOccupancyTuner& other) = default;
+  RangePolicyOccupancyTuner& operator=(RangePolicyOccupancyTuner&& other) =
+      default;
+  RangePolicyOccupancyTuner(RangePolicyOccupancyTuner&& other) = default;


Suggested change

RangePolicyOccupancyTuner& operator=(const RangePolicyOccupancyTuner& other) =

default;

RangePolicyOccupancyTuner(const RangePolicyOccupancyTuner& other) = default;

RangePolicyOccupancyTuner& operator=(RangePolicyOccupancyTuner&& other) =

default;

RangePolicyOccupancyTuner(RangePolicyOccupancyTuner&& other) = default;

This change breaks compilation...

Interesting. How is it failing?

/usr/include/c++/8/tuple(1668): error: no instance of constructor "Kokkos::Tools::Experimental::RangePolicyOccupancyTuner::RangePolicyOccupancyTuner" matches the argument list detected during: instantiation of "std::pair<_T1, _T2>::pair(std::tuple<_Args1...> &, std::tuple<_Args2...> &, std::_Index_tuple<_Indexes1...>, std::_Index_tuple<_Indexes2...>) [with _T1=const std::__cxx11::string, _T2=Kokkos::Tools::Experimental::RangePolicyOccupancyTuner, _Args1=<const std::__cxx11::string &>, _Indexes1=<0UL>, _Args2=<>, _Indexes2=<>]" (1658): here instantiation of "std::pair<_T1, _T2>::pair(std::piecewise_construct_t, std::tuple<_Args1...>, std::tuple<_Args2...>) [with _T1=const std::__cxx11::string, _T2=Kokkos::Tools::Experimental::RangePolicyOccupancyTuner, _Args1=<const std::__cxx11::string &>, _Args2=<>]" /usr/include/c++/8/ext/new_allocator.h(136): here instantiation of "void __gnu_cxx::new_allocator<_Tp>::construct(_Up *, _Args &&...) [with _Tp=std::_Rb_tree_node<std::pair<const std::__cxx11::string, Kokkos::Tools::Experimental::RangePolicyOccupancyTuner>>, _Up=std::pair<const std::__cxx11::string, Kokkos::Tools::Experimental::RangePolicyOccupancyTuner>, _Args=<const std::piecewise_construct_t &, std::tuple<const std::__cxx11::string &>, std::tuple<>>]" /usr/include/c++/8/bits/alloc_traits.h(475): here instantiation of "void std::allocator_traits<std::allocator<_Tp>>::construct(std::allocator_traits<std::allocator<_Tp>>::allocator_type &, _Up *, _Args &&...) [with _Tp=std::_Rb_tree_node<std::pair<const std::__cxx11::string, Kokkos::Tools::Experimental::RangePolicyOccupancyTuner>>, _Up=std::pair<const std::__cxx11::string, Kokkos::Tools::Experimental::RangePolicyOccupancyTuner>, _Args=<const std::piecewise_construct_t &, std::tuple<const std::__cxx11::string &>, std::tuple<>>]" /usr/include/c++/8/bits/stl_tree.h(628): here instantiation of "void std::_Rb_tree<_Key, _Val, _KeyOfValue, _Compare, _Alloc>::_M_construct_node(std::_Rb_tree<_Key, _Val, _KeyOfValue, _Compare, _Alloc>::_Link_type, _Args &&...) [with _Key=std::__cxx11::string, _Val=std::pair<const std::__cxx11::string, Kokkos::Tools::Experimental::RangePolicyOccupancyTuner>, _KeyOfValue=std::_Select1st<std::pair<const std::__cxx11::string, Kokkos::Tools::Experimental::RangePolicyOccupancyTuner>>, _Compare=std::less<std::__cxx11::string>, _Alloc=std::allocator<std::pair<const std::__cxx11::string, Kokkos::Tools::Experimental::RangePolicyOccupancyTuner>>, _Args=<const std::piecewise_construct_t &, std::tuple<const std::__cxx11::string &>, std::tuple<>>]" /usr/include/c++/8/bits/stl_tree.h(643): here [ 2 instantiation contexts not shown ] instantiation of "std::map<_Key, _Tp, _Compare, _Alloc>::mapped_type &std::map<_Key, _Tp, _Compare, _Alloc>::operator[](const std::map<_Key, _Tp, _Compare, _Alloc>::key_type &) [with _Key=std::__cxx11::string, _Tp=Kokkos::Tools::Experimental::RangePolicyOccupancyTuner, _Compare=std::less<std::__cxx11::string>, _Alloc=std::allocator<std::pair<const std::__cxx11::string, Kokkos::Tools::Experimental::RangePolicyOccupancyTuner>>]" /home/users/khuck/src/occupancy/kokkos/core/src/impl/Kokkos_Profiling.hpp(638): here instantiation of "void Kokkos::Tools::Impl::generic_report_results<Tuner,Functor,TagType,TuningPermissionFunctor,Map,Policy>(const std::__cxx11::string &, Map &, const Policy &, const Functor &, const TagType &, const TuningPermissionFunctor &) [with Tuner=Kokkos::Tools::Experimental::RangePolicyOccupancyTuner, Functor=lambda [](ptrdiff_t)->void, TagType=Kokkos::ParallelForTag, TuningPermissionFunctor=lambda [](const Policy &)->__nv_bool, Map=std::map<std::__cxx11::string, Kokkos::Tools::Experimental::RangePolicyOccupancyTuner, std::less<std::__cxx11::string>, std::allocator<std::pair<const std::__cxx11::string, Kokkos::Tools::Experimental::RangePolicyOccupancyTuner>>>, Policy=Kokkos::RangePolicy<Kokkos::DefaultHostExecutionSpace>]" /home/users/khuck/src/occupancy/kokkos/core/src/impl/Kokkos_Profiling.hpp(688): here instantiation of "void Kokkos::Tools::Impl::report_policy_results(size_t, const std::__cxx11::string &, const Kokkos::RangePolicy<Properties...> &, const Functor &, const TagType &) [with Functor=lambda [](ptrdiff_t)->void, TagType=Kokkos::ParallelForTag, Properties=<Kokkos::DefaultHostExecutionSpace>]" /home/users/khuck/src/occupancy/kokkos/core/src/impl/Kokkos_Profiling.hpp(726): here instantiation of "void Kokkos::Tools::Impl::end_parallel_for(const ExecPolicy &, FunctorType &, const std::__cxx11::string &, uint64_t &) [with ExecPolicy=Kokkos::RangePolicy<Kokkos::DefaultHostExecutionSpace>, FunctorType=const lambda [](ptrdiff_t)->void]" /home/users/khuck/src/occupancy/kokkos/core/src/Kokkos_Parallel.hpp(148): here instantiation of "void Kokkos::parallel_for(const std::__cxx11::string &, const ExecPolicy &, const FunctorType &) [with ExecPolicy=Kokkos::RangePolicy<Kokkos::DefaultHostExecutionSpace>, FunctorType=lambda [](ptrdiff_t)->void, Enable=void]" /home/users/khuck/src/occupancy/kokkos/core/src/impl/Kokkos_HostSpace_deepcopy.cpp(81): here 1 error detected in the compilation of "/home/users/khuck/src/occupancy/kokkos/core/src/impl/Kokkos_HostSpace_deepcopy.cpp". gmake[2]: *** [kokkos/core/src/CMakeFiles/kokkoscore.dir/build.make:188: kokkos/core/src/CMakeFiles/kokkoscore.dir/impl/Kokkos_HostSpace_deepcopy.cpp.o] Error 1 gmake[2]: *** Waiting for unfinished jobs.... gmake[1]: *** [CMakeFiles/Makefile2:1254: kokkos/core/src/CMakeFiles/kokkoscore.dir/all] Error 2 gmake: *** [Makefile:136: all] Error 2

I can see similar errors when I remove the default constructor but not when removing the defaulted special member functions.

oops, you're right, I deleted one too many lines...

masterleinad · 2024-04-15T12:49:22Z

core/src/impl/Kokkos_Profiling.hpp

+// when we have a complex reducer, we need to pass an
+// instance to team_size_recommended/max. Reducers
+// aren't default constructible, but they are
+// constructible from a reference to an
+// instance of their value_type so we construct
+// a value_type and temporary reducer here
+struct ComplexReducerSizeCalculator {
+  template <typename Policy, typename FunctorReducer, typename Tag>
+  int get_max_team_size(const Policy& policy,
+                        const FunctorReducer& functor_reducer, const Tag tag) {
+    return policy.team_size_max(functor_reducer.get_functor(),
+                                functor_reducer.get_reducer(), tag);
+  }
+  template <typename Policy, typename FunctorReducer, typename Tag>
+  int get_recommended_team_size(const Policy& policy,
+                                const FunctorReducer& functor_reducer,
+                                const Tag tag) {
+    return policy.team_size_recommended(functor_reducer.get_functor(),
+                                        functor_reducer.get_reducer(), tag);
+  }
+  template <typename Policy, typename FunctorReducer>
+  int get_mdrange_max_tile_size_product(const Policy& policy,
+                                        const FunctorReducer& functor_reducer,
+                                        const Kokkos::ParallelReduceTag&) {
+    using exec_space = typename Policy::execution_space;
+    using driver =
+        Kokkos::Impl::ParallelReduce<FunctorReducer, Policy, exec_space>;
+    return driver::max_tile_size_product(policy, functor_reducer.get_functor());
+  }
+};


We should be able to just use SimpleTeamSizeCalculator now that ParallelReduce always takes a FunctorReducer argument.
We could do this cleanup in a follow-up pull request, of course.

core/src/impl/Kokkos_Profiling.hpp

khuck · 2024-04-15T17:43:50Z

Looks pretty good to me. Any chance that we can test this?

I will see what I can do to make sure I test the MDRange, Team and Range policies for the occupancy tuning... I know I have tested the Range with a big Arbor-X example but it would be good to have some small tests for the Kokkos build, if possible. I am not sure how that would work without including APEX, though...

Co-authored-by: Daniel Arndt <arndtd@ornl.gov>

khuck · 2024-04-19T18:26:32Z

@masterleinad @Rombur Here's a GitHub repo that includes six test programs and will build Kokkos and APEX to test the tuning. I know it's not a Kokkos unit test, but this at least will test everything all together. It's possible this could be integrated into the kokkos-tools build, which also builds APEX. https://github.com/khuck/apex-kokkos-tuning
In particular, https://github.com/khuck/apex-kokkos-tuning/blob/aa7415e07d74fe76f4d5274c9c835ff517352aa9/tests/CMakeLists.txt#L20-L22 shows the apex_exec command line to run the tuning. The next step after this one is to integrate the RangePolicy tuning that is another branch of David's forked repository.

masterleinad · 2024-04-19T19:26:40Z

@masterleinad @Rombur Here's a GitHub repo that includes six test programs and will build Kokkos and APEX to test the tuning. I know it's not a Kokkos unit test, but this at least will test everything all together. It's possible this could be integrated into the kokkos-tools build, which also builds APEX. khuck/apex-kokkos-tuning

That might be a good solution.

Rombur · 2024-04-19T19:32:37Z

@khuck thanks for the tests. We can put them in Kokkos-Tools or in our nightly.
Looking at the tests, it seems the tests are just running but they are no check. I understand that the tuning will vary with the hardware but do you think that we can check that the values are in a certain range? What worries me is that one day we break sth and we only try 100% of occupancy. With the current tests, we wouldn't realize that there is an issue, right?

khuck · 2024-04-19T20:41:35Z

@Rombur good point, I should have added some validation. yes, there will be some stochastic behavior in the results, but I can at least check for convergence in the output text or in the output convergence result file. Let me add that...

khuck · 2024-04-22T21:54:56Z

@Rombur good point, I should have added some validation. yes, there will be some stochastic behavior in the results, but I can at least check for convergence in the output text or in the output convergence result file. Let me add that...

OK! tests are updated and expanded. I now test all the benchmarks with all the different tuning options: no tuning, exhaustive, random, genetic_search and simulated_annealing. Each test then uses the cached/converged results (if converged) to run. Then there is a fake test to clean up the converged results for the next test. I tested on a generic CUDA test machine with cuda/11.7 and Perlmutter with default modules (cuda 12.2 and gcc 12.3). I did find one problem on Frontier, though - when I run the test test_mdrange_gemm_occupancy_no_tuning, I get a segmentation fault in hip. Interestingly, it's only that test, and only when not using APEX or autotuning - just running the build/tests/mdrange_gemm_occupancy causes the crash. I assume it has something to do with this line (requesting the AUTO occupancy, which I did not think is implemented in HIP?...) https://github.com/khuck/apex-kokkos-tuning/blob/718f20e4c0f7e21db768c25caddd0c6ee0dd15ca/tests/mdrange_gemm_occupancy.cpp#L65-L66

Rombur · 2024-04-23T12:00:26Z

hmm I have no idea why it segfaults. Can you get a backtrace?

khuck · 2024-04-23T17:10:26Z

hmm I have no idea why it segfaults. Can you get a backtrace?

OK, it's fixed. the problem was with the benchmark, I guess I don't totally understand how some of this boilerplate that David Poliakof wrote works. Apparently if the size of the View is smaller than the number of loop iterations, indexes larger than the view are passed in to the parallel_for loop. The problem was in the test, not Kokkos.

masterleinad · 2024-04-23T17:30:28Z

Apparently if the size of the View is smaller than the number of loop iterations, indexes larger than the view are passed in to the parallel_for loop.

That sounds weird. The tuning should not modify the number of work items, right?

khuck · 2024-04-23T17:37:49Z

Apparently if the size of the View is smaller than the number of loop iterations, indexes larger than the view are passed in to the parallel_for loop.

That sounds weird. The tuning should not modify the number of work items, right?

correct. the problem isn't with the tuning in kokkos or in apex, but the tuning_playground.hpp file. I am still looking into it...

khuck · 2024-04-23T19:26:13Z

Apparently if the size of the View is smaller than the number of loop iterations, indexes larger than the view are passed in to the parallel_for loop.

That sounds weird. The tuning should not modify the number of work items, right?

That's correct, and it does not. The problem is the test program, I think I have it fixed...

khuck · 2024-04-23T23:02:20Z

@Rombur @masterleinad Ok! all tests are passing and working correctly. I tested on frontier and Perlmutter and my V100 system. I pushed the updates to the apex-kokkos-tuning repo, you should be able to build & run that - if not, let me know.
By the way, the "idk_just_mm_no_tuning" test fails on frontier due to a 180 second timeout (I did confirm that it will run to completion if you let it). The bad team-based gemm (https://github.com/khuck/apex-kokkos-tuning/blob/737da07df67840a07bb04cfa6923a3ac59cb376c/tests/idk_jmm.cpp#L64-L83) is really slow on MI250X. It's not fast on cuda either, but at least it completes in 80-110 seconds (depending on V100 or A100)

Rombur

The code looks fine. Once this is merged I'll add the tests in our nightly.

Rombur · 2024-04-24T15:39:26Z

core/src/impl/Kokkos_Profiling.hpp

+                          Kokkos::ParallelForTag{});
+  }
+#else
+  (void)policy;


You could use [[maybe_unused]] instead of having the #else branch,

khuck · 2024-04-26T22:09:19Z

@Rombur @masterleinad unfortunately, I seem to have found a logic bug in how tuning contexts are incremented. There is a simple fix, but I need to do a bit more testing with it...

Minor fix to prevent incrementing the context id index when not calling `context_begin()`. In actuality, this should be refactored so that `begin_context()` increments the id, and returns it. `end_context()` is the only location that decrements the context id index.

khuck · 2024-04-26T23:08:16Z

@Rombur @masterleinad unfortunately, I seem to have found a logic bug in how tuning contexts are incremented. There is a simple fix, but I need to do a bit more testing with it...

Simple fix committed!

khuck · 2024-05-07T20:00:02Z

Is there anything else that needs to be done with this PR so it can be merged?

Rombur · 2024-05-07T20:27:09Z

@crtrott @dalg24 I think this is ready to be merged

masterleinad reviewed Feb 7, 2024

View reviewed changes

khuck force-pushed the merge-develop-occupancy-tuning branch from 17d0af0 to 8dba118 Compare March 11, 2024 17:39

Fixing formatting check, not sure how those reverted

ad3efcd

khuck added 4 commits April 11, 2024 13:00

Fixing problems with recursive Impl namespace, MDRange Reduce tuning …

b671810

…and OpenMP Reduce tuning. Now trying to fix Team tuning...

removing comments that failed format check

4c9d294

Removing commented code

ca37e69

Fixing formatting

9f2e0a1

Final code fixes, likely to be some formatting fixes needed.

bf5aa7e

khuck added 2 commits April 12, 2024 11:38

Expected formatting changes

1010662

Yet another formatting fix...

dc8a431

masterleinad reviewed Apr 15, 2024

View reviewed changes

khuck and others added 4 commits April 18, 2024 13:27

Removing default operators and copy constructors that aren't needed

c3d583b

Update core/src/impl/Kokkos_Profiling.hpp

50fbbc2

Co-authored-by: Daniel Arndt <arndtd@ornl.gov>

Fixing formatting check

55ef67f

Clang-format complained about a newline

ba51e8b

Rombur approved these changes Apr 24, 2024

View reviewed changes

masterleinad approved these changes Apr 24, 2024

View reviewed changes

Adding occupancy tuning for CUDA architectures #6788

Are you sure you want to change the base?

Adding occupancy tuning for CUDA architectures #6788

Conversation

khuck commented Feb 6, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Rombur commented Mar 4, 2024

khuck commented Mar 4, 2024

khuck commented Mar 11, 2024

Rombur commented Mar 14, 2024

khuck commented Mar 14, 2024

masterleinad commented Mar 14, 2024 • edited

Rombur commented Mar 14, 2024

khuck commented Mar 14, 2024

Rombur commented Apr 10, 2024

Rombur commented Apr 10, 2024

khuck commented Apr 10, 2024

khuck commented Apr 10, 2024

masterleinad commented Apr 11, 2024

Rombur commented Apr 11, 2024

khuck commented Apr 11, 2024

khuck commented Apr 11, 2024

khuck commented Apr 11, 2024

khuck commented Apr 11, 2024 • edited

masterleinad commented Apr 12, 2024

khuck commented Apr 12, 2024

masterleinad left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

khuck commented Apr 15, 2024

khuck commented Apr 19, 2024

masterleinad commented Apr 19, 2024

Rombur commented Apr 19, 2024

khuck commented Apr 19, 2024

khuck commented Apr 22, 2024

Rombur commented Apr 23, 2024

khuck commented Apr 23, 2024 • edited

masterleinad commented Apr 23, 2024

khuck commented Apr 23, 2024

khuck commented Apr 23, 2024

khuck commented Apr 23, 2024

Rombur left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

khuck commented Apr 26, 2024

khuck commented Apr 26, 2024

khuck commented May 7, 2024

Rombur commented May 7, 2024

masterleinad commented Mar 14, 2024 •

edited

khuck commented Apr 11, 2024 •

edited

khuck commented Apr 23, 2024 •

edited