Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding occupancy tuning for CUDA architectures #6788

Open
wants to merge 14 commits into
base: develop
Choose a base branch
from

Conversation

khuck
Copy link

@khuck khuck commented Feb 6, 2024

The old Kokkos fork/branch from :
https://github.com/DavidPoliakoff/kokkos/tree/feature/tune-occupancy
was merged with current Kokkos develop, and tested with ArborX to
confirm that autotuning occupancy for the DBSCAN benchmark worked.
In tests on a system with V100, the original benchmark when iterated
600 times took 119.064 seconds to run. During the tuning process
(using simulated annealing), the runtime was 108.014 seconds.
When using cached results, the runtime was 109.058 seconds. The
converged occupancy value was 70. Here are the cached results
from APEX autotuning:

Input_1:
  name: kokkos.kernel_name
  id: 1
  info.type: string
  info.category: categorical
  info.valueQuantity: unbounded
  info.candidates: unbounded
  num_bins: 0
Input_2:
  name: kokkos.kernel_type
  id: 2
  info.type: string
  info.category: categorical
  info.valueQuantity: set
  info.candidates: [parallel_for,parallel_reduce,parallel_scan,parallel_copy]
Output_3:
  name: ArborX::Experimental::HalfTraversal
  id: 3
  info.type: int64
  info.category: ratio
  info.valueQuantity: range
  info.candidates:
    lower: 5
    upper: 100
    step: 5
    open upper: 0
    open lower: 0
Context_0:
  Name: "[2:parallel_for,1:ArborX::Experimental::HalfTraversal,tree_node:default]"
  Converged: true
  Results:
    NumVars: 1
    id: 3
    value: 70

In manual experiments, the ArborX team determined that the optimal
occupancy for this example was beetween 40-90, which were a 10%
improvement over baseline default of 100. See arborx/ArborX#815
for details.

One deviation from the branch that David had written - I set the occupancy
range to [5-100], with a step size of 5. The original implementation
in Kokkos used [1-100] with a step size of 1. The change was to reduce the search space
and allow faster convergence.

For anything that works, all credit goes to @DavidPoliakoff

@@ -450,7 +450,7 @@ class CudaInternal {
cudaError_t cuda_func_set_attributes_wrapper(T* entry, cudaFuncAttribute attr,
int value) const {
if constexpr (setCudaDevice) set_cuda_device();
return cudaFuncSetAttributes(entry, attr, value);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in #6786.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does that mean I should revert this fix in my PR?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just rebase it on top of develop.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines +20 to +28
#include <Kokkos_Core_fwd.hpp>
#include <Kokkos_ExecPolicy.hpp>
#include <Kokkos_Macros.hpp>
#include <Kokkos_Tuners.hpp>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs

Suggested change
#include <Kokkos_Core_fwd.hpp>
#include <Kokkos_ExecPolicy.hpp>
#include <Kokkos_Macros.hpp>
#include <Kokkos_Tuners.hpp>
#ifndef KOKKOS_IMPL_PUBLIC_INCLUDE
#define KOKKOS_IMPL_PUBLIC_INCLUDE
#define KOKKOS_IMPL_PUBLIC_INCLUDE_NOTDEFINED_PROFILING
#endif
#include <Kokkos_Core_fwd.hpp>
#include <Kokkos_ExecPolicy.hpp>
#include <Kokkos_Macros.hpp>
#include <Kokkos_Tuners.hpp>

and

#ifdef KOKKOS_IMPL_PUBLIC_INCLUDE_NOTDEFINED_PROFILING
#undef KOKKOS_IMPL_PUBLIC_INCLUDE
#undef KOKKOS_IMPL_PUBLIC_INCLUDE_NOTDEFINED_PROFILING
#endif 

at the end of the file since we are now including non-public headers.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@Rombur
Copy link
Member

Rombur commented Mar 4, 2024

Something is strange with PR. 87 files have been changed and my browser crashed when I tried to look at the difference. Can you try to rebase on develop once again.

@khuck
Copy link
Author

khuck commented Mar 4, 2024

@Rombur yes, I noticed the same when I rebased for #6786 as suggested in the first comment... I'll see what I can do to try to clean this up.

Note: This is a re-commit of a somehow polluted branch when I rebased on
develop. I started over with the 5 changed files.

The old Kokkos fork/branch from :
davidp	git@github.com:DavidPoliakoff/kokkos.git (fetch)
was merged with current Kokkos develop, and tested with ArborX to
confirm that autotuning occupancy for the DBSCAN benchmark worked.
In tests on a system with V100, the original benchmark when iterated
600 times took 119.064 seconds to run. During the tuning process
(using simulated annealing), the runtime was 108.014 seconds.
When using cached results, the runtime was 109.058 seconds. The
converged occupancy value was 70. Here are the cached results
from APEX autotuning:

Input_1:
  name: kokkos.kernel_name
  id: 1
  info.type: string
  info.category: categorical
  info.valueQuantity: unbounded
  info.candidates: unbounded
  num_bins: 0
Input_2:
  name: kokkos.kernel_type
  id: 2
  info.type: string
  info.category: categorical
  info.valueQuantity: set
  info.candidates: [parallel_for,parallel_reduce,parallel_scan,parallel_copy]
Output_3:
  name: ArborX::Experimental::HalfTraversal
  id: 3
  info.type: int64
  info.category: ratio
  info.valueQuantity: range
  info.candidates:
    lower: 5
    upper: 100
    step: 5
    open upper: 0
    open lower: 0
Context_0:
  Name: "[2:parallel_for,1:ArborX::Experimental::HalfTraversal,tree_node:default]"
  Converged: true
  Results:
    NumVars: 1
    id: 3
    value: 70

In manual experiments, the ArborX team determined that the optimal
occupancy for this example was beetween 40-90, which were a 10%
improvement over baseline default of 100. See arborx/ArborX#815
for details.

One deviation from the branch that David had written - the occupancy
range is [5-100], with a step size of 5. The original implementation
in Kokkos used [1-100] with a step size of 1.
@khuck khuck force-pushed the merge-develop-occupancy-tuning branch from 17d0af0 to 8dba118 Compare March 11, 2024 17:39
@khuck
Copy link
Author

khuck commented Mar 11, 2024

@Rombur @masterleinad I re-committed the branch on the current develop, there shouldn't be a corrupted commit history now. Let me know if you need any other changes. Thanks!

@Rombur
Copy link
Member

Rombur commented Mar 14, 2024

The two configurations that have TUNING enable are failing. I'll try to look at the issue when I have time.

@khuck
Copy link
Author

khuck commented Mar 14, 2024

The two failures I see are related to something specific due to linking fortran on fedora and clang, I have no idea what that's about... I was able to reproduce it with a docker image of my own but I don't think it's related?

@masterleinad
Copy link
Contributor

masterleinad commented Mar 14, 2024

I was able to reproduce it with a docker image of my own but I don't think it's related?

How did you reproduce? What images were you using?

@Rombur
Copy link
Member

Rombur commented Mar 14, 2024

@khuck I am talking about this and this
In the CUDA case, the relevant error is:

error: too many errors emitted, stopping now [clang-diagnostic-error]
/var/jenkins/workspace/Kokkos_PR-6788/core/src/Cuda/Kokkos_Cuda_Parallel_Team.hpp:622:48: error: no type named 'pointer_type' in 'Kokkos::BOr<unsigned char>' [clang-diagnostic-error]
  using pointer_type   = typename ReducerType::pointer_type;
                                               ^
/var/jenkins/workspace/Kokkos_PR-6788/core/src/Cuda/Kokkos_Cuda_Parallel_Team.hpp:368:51: note: in instantiation of template class 'Kokkos::Impl::ParallelReduce<Kokkos::Impl::CombinedFunctorReducer<Test::(anonymous namespace)::TestTeamBroadcast<Kokkos::Cuda, Kokkos::Schedule<Kokkos::Static>, unsigned char>, Kokkos::BOr<unsigned char>>, Kokkos::TeamPolicy<Kokkos::Schedule<Kokkos::Static>, Kokkos::Cuda>, Kokkos::Cuda>' requested here
        Impl::FunctorAnalysis<Interface, typename ClosureType::Policy,
                                                  ^
/var/jenkins/workspace/Kokkos_PR-6788/core/src/Cuda/Kokkos_Cuda_Parallel_Team.hpp:391:12: note: in instantiation of function template specialization 'Kokkos::Impl::TeamPolicyInternal<Kokkos::Cuda, Kokkos::Schedule<Kokkos::Static>, Kokkos::Cuda>::internal_team_size_common<Kokkos::Impl::ParallelReduce<Kokkos::Impl::CombinedFunctorReducer<Test::(anonymous namespace)::TestTeamBroadcast<Kokkos::Cuda, Kokkos::Schedule<Kokkos::Static>, unsigned char>, Kokkos::BOr<unsigned char>>, Kokkos::TeamPolicy<Kokkos::Schedule<Kokkos::Static>, Kokkos::Cuda>, Kokkos::Cuda>, Test::(anonymous namespace)::TestTeamBroadcast<Kokkos::Cuda, Kokkos::Schedule<Kokkos::Static>, unsigned char>, int (&)(const Kokkos::Impl::CudaInternal *, const cudaFuncAttributes &, const Test::(anonymous namespace)::TestTeamBroadcast<Kokkos::Cuda, Kokkos::Schedule<Kokkos::Static>, unsigned char> &, unsigned long, unsigned long, unsigned long)>' requested here
    return internal_team_size_common<ClosureType>(
           ^
/var/jenkins/workspace/Kokkos_PR-6788/core/src/Cuda/Kokkos_Cuda_Parallel_Team.hpp:131:12: note: in instantiation of function template specialization 'Kokkos::Impl::TeamPolicyInternal<Kokkos::Cuda, Kokkos::Schedule<Kokkos::Static>, Kokkos::Cuda>::internal_team_size_max<Kokkos::Impl::ParallelReduce<Kokkos::Impl::CombinedFunctorReducer<Test::(anonymous namespace)::TestTeamBroadcast<Kokkos::Cuda, Kokkos::Schedule<Kokkos::Static>, unsigned char>, Kokkos::BOr<unsigned char>>, Kokkos::TeamPolicy<Kokkos::Schedule<Kokkos::Static>, Kokkos::Cuda>, Kokkos::Cuda>, Test::(anonymous namespace)::TestTeamBroadcast<Kokkos::Cuda, Kokkos::Schedule<Kokkos::Static>, unsigned char>>' requested here
    return internal_team_size_max<closure_type>(f);
           ^
/var/jenkins/workspace/Kokkos_PR-6788/core/src/impl/Kokkos_Profiling.hpp:373:19: note: in instantiation of function template specialization 'Kokkos::Impl::TeamPolicyInternal<Kokkos::Cuda, Kokkos::Schedule<Kokkos::Static>, Kokkos::Cuda>::team_size_max<Test::(anonymous namespace)::TestTeamBroadcast<Kokkos::Cuda, Kokkos::Schedule<Kokkos::Static>, unsigned char>, Kokkos::BOr<unsigned char>>' requested here
    return policy.team_size_max(functor, reducer_example, tag);
                  ^
/var/jenkins/workspace/Kokkos_PR-6788/core/src/Kokkos_Tuners.hpp:479:35: note: in instantiation of function template specialization 'Kokkos::Tools::Impl::Impl::ComplexReducerSizeCalculator<Kokkos::BOr<unsigned char>>::get_max_team_size<Kokkos::TeamPolicy<Kokkos::Schedule<Kokkos::Static>, Kokkos::Cuda>, Test::(anonymous namespace)::TestTeamBroadcast<Kokkos::Cuda, Kokkos::Schedule<Kokkos::Static>, unsigned char>, Kokkos::ParallelReduceTag>' requested here
        auto max_team_size = calc.get_max_team_size(policy, functor, tag);
                                  ^
/var/jenkins/workspace/Kokkos_PR-6788/core/src/impl/Kokkos_Profiling.hpp:448:24: note: in instantiation of function template specialization 'Kokkos::Tools::Experimental::TeamSizeTuner::TeamSizeTuner<Kokkos::Tools::Impl::Impl::ComplexReducerSizeCalculator<Kokkos::BOr<unsigned char>>, Test::(anonymous namespace)::TestTeamBroadcast<Kokkos::Cuda, Kokkos::Schedule<Kokkos::Static>, unsigned char>, Kokkos::ParallelReduceTag, Kokkos::Schedule<Kokkos::Static>, Kokkos::Cuda>' requested here
                       Tuner(label, policy, functor, tag,
                       ^
/var/jenkins/workspace/Kokkos_PR-6788/core/src/Kokkos_Parallel_Reduce.hpp:1535:5: note: in instantiation of member function 'Kokkos::Impl::ParallelReduceAdaptor<Kokkos::TeamPolicy<Kokkos::Schedule<Kokkos::Static>, Kokkos::Cuda>, Test::(anonymous namespace)::TestTeamBroadcast<Kokkos::Cuda, Kokkos::Schedule<Kokkos::Static>, unsigned char>, Kokkos::BOr<unsigned char>>::execute_impl' requested here
    execute_impl(label, policy, functor, return_value);
    ^
/var/jenkins/workspace/Kokkos_PR-6788/core/src/Kokkos_Parallel_Reduce.hpp:1771:69: note: in instantiation of function template specialization 'Kokkos::Impl::ParallelReduceAdaptor<Kokkos::TeamPolicy<Kokkos::Schedule<Kokkos::Static>, Kokkos::Cuda>, Test::(anonymous namespace)::TestTeamBroadcast<Kokkos::Cuda, Kokkos::Schedule<Kokkos::Static>, unsigned char>, Kokkos::BOr<unsigned char>>::execute<Kokkos::BOr<unsigned char>>' requested here
  Impl::ParallelReduceAdaptor<PolicyType, FunctorType, ReturnType>::execute(
                                                                    ^
/var/jenkins/workspace/Kokkos_PR-6788/core/unit_test/TestTeam.hpp:1336:13: note: in instantiation of function template specialization 'Kokkos::parallel_reduce<Kokkos::TeamPolicy<Kokkos::Schedule<Kokkos::Static>, Kokkos::Cuda>, Test::(anonymous namespace)::TestTeamBroadcast<Kokkos::Cuda, Kokkos::Schedule<Kokkos::Static>, unsigned char>, Kokkos::BOr<unsigned char>>' requested here
    Kokkos::parallel_reduce(policy_type(league_size, team_size), functor,
            ^
/var/jenkins/workspace/Kokkos_PR-6788/core/unit_test/TestTeamBasic.hpp:309:39: note: in instantiation of member function 'Test::(anonymous namespace)::TestTeamBroadcast<Kokkos::Cuda, Kokkos::Schedule<Kokkos::Static>, unsigned char>::test_teambroadcast' requested here
                      unsigned char>::test_teambroadcast(0, 1);

@khuck
Copy link
Author

khuck commented Mar 14, 2024

I was able to reproduce it with a docker image of my own but I don't think it's related?

How did you reproduce? What images were you using?

Oh, I was looking at the wrong failure. I was running the fedora:41 clang++ test, not the last one on the list...

@Rombur
Copy link
Member

Rombur commented Apr 10, 2024

It looks like the issue is with CUDA Clang. The failures are from OPENMPTARGET-CLANG and CUDA-CLANG which both use Clang on Nvidia GPUs. I just tried this PR with CUDA 12.4 and clang 17.0.6 and the PR works. I am going to try with an older compiler stack.

@Rombur
Copy link
Member

Rombur commented Apr 10, 2024

I forgot to turn tuning on... Now I can reproduce the issue.

@khuck
Copy link
Author

khuck commented Apr 10, 2024

@Rombur are you able to reproduce it with a newer clang and cuda? because it has taken me a while to get clang 12.0.0 and cuda 11 together on the same machine, but I can now reproduce the build error...

@khuck
Copy link
Author

khuck commented Apr 10, 2024

I forgot to turn tuning on... Now I can reproduce the issue.

Do you think it is something specific to ParallelReduce?

@masterleinad
Copy link
Contributor

diff --git a/core/src/Kokkos_Parallel_Reduce.hpp b/core/src/Kokkos_Parallel_Reduce.hpp
index 39cb87acd..c656f72c2 100644
--- a/core/src/Kokkos_Parallel_Reduce.hpp
+++ b/core/src/Kokkos_Parallel_Reduce.hpp
@@ -1492,12 +1492,6 @@ struct ParallelReduceAdaptor {
     using PassedReducerType = typename return_value_adapter::reducer_type;
     uint64_t kpID           = 0;
 
-    /** Request a tuned policy from the tools subsystem */
-    auto response = Kokkos::Tools::Impl::begin_parallel_reduce<
-        typename return_value_adapter::reducer_type>(policy, functor, label,
-                                                     kpID);
-    auto& inner_policy = response.policy;
-
     using ReducerSelector =
         Kokkos::Impl::if_c<std::is_same<InvalidType, PassedReducerType>::value,
                            FunctorType, PassedReducerType>;
@@ -1509,7 +1503,12 @@ struct ParallelReduceAdaptor {
         functor, typename Analysis::Reducer(
                      ReducerSelector::select(functor, return_value)));
 
-    // FIXME Remove "Wrapper" once all backends implement the new interface
+  /** Request a tuned policy from the tools subsystem */
+    auto response = Kokkos::Tools::Impl::begin_parallel_reduce<
+        typename return_value_adapter::reducer_type>(policy, functor_reducer, label,
+                                                     kpID);
+    auto& inner_policy = response.policy;
+
     Impl::ParallelReduce<decltype(functor_reducer), PolicyType,
                          typename Impl::FunctorPolicyExecutionSpace<
                              FunctorType, PolicyType>::execution_space>

should fix it.

@Rombur
Copy link
Member

Rombur commented Apr 11, 2024

@khuck Yes, I can reproduce the error with CUDA 12.4 and clang 17.0.6. @masterleinad patch fixes one issue but there are others.

@khuck
Copy link
Author

khuck commented Apr 11, 2024

@masterleinad your fix helps, but doesn't fix everything. @Rombur Actually, it isn't just clang + cuda, I can reproduce it with gcc 9.4.0 when building the tests...

@khuck
Copy link
Author

khuck commented Apr 11, 2024

this also helps some:

diff --git a/core/src/OpenMP/Kokkos_OpenMP_Parallel_Reduce.hpp b/core/src/OpenMP/Kokkos_OpenMP_Parallel_Reduce.hpp
index 05fd1c9dc..6a52b5ec2 100644
--- a/core/src/OpenMP/Kokkos_OpenMP_Parallel_Reduce.hpp
+++ b/core/src/OpenMP/Kokkos_OpenMP_Parallel_Reduce.hpp
@@ -34,6 +34,8 @@ class ParallelReduce<CombinedFunctorReducerType, Kokkos::RangePolicy<Traits...>,
   using Policy      = Kokkos::RangePolicy<Traits...>;
   using FunctorType = typename CombinedFunctorReducerType::functor_type;
   using ReducerType = typename CombinedFunctorReducerType::reducer_type;
+  using functor_type   = FunctorType;
+  using reducer_type   = ReducerType;
 
   using WorkTag = typename Policy::work_tag;
   using Member  = typename Policy::member_type;
@@ -191,6 +193,8 @@ class ParallelReduce<CombinedFunctorReducerType,
   using Policy        = typename MDRangePolicy::impl_range_policy;
   using FunctorType   = typename CombinedFunctorReducerType::functor_type;
   using ReducerType   = typename CombinedFunctorReducerType::reducer_type;
+  using functor_type   = FunctorType;
+  using reducer_type   = ReducerType;
 
   using WorkTag = typename MDRangePolicy::work_tag;
   using Member  = typename Policy::member_type;
@@ -343,6 +347,8 @@ class ParallelReduce<CombinedFunctorReducerType,
       Kokkos::Impl::TeamPolicyInternal<Kokkos::OpenMP, Properties...>;
   using FunctorType = typename CombinedFunctorReducerType::functor_type;
   using ReducerType = typename CombinedFunctorReducerType::reducer_type;
+  using functor_type   = FunctorType;
+  using reducer_type   = ReducerType;
 
   using WorkTag  = typename Policy::work_tag;
   using SchedTag = typename Policy::schedule_type::type;

@khuck
Copy link
Author

khuck commented Apr 11, 2024

I think I found a clue... the remaining issues seem to be related to the fact that for MDRange reductions, there isn't a CombinedFunctorReducer getting created somewhere, but that's what is expected. Instead, Kokkos::Impl::ParallelReduce<> is getting passed a Functor...

@khuck
Copy link
Author

khuck commented Apr 11, 2024

OK, using @masterleinad 's fix I was able to get past the problems with MDRange. Now I am having an issue with Teams and a simple reduce. When the CombinedFunctorReducer is constructed, the Reducer returned by this call https://github.com/khuck/kokkos/blob/e0dc0128e04f18c2bbbaefceef3616e7ddcfa3c4/core/src/Kokkos_Parallel_Reduce.hpp#L1506-L1508 doesn't have an operator() method:

[ 18%] Building CXX object core/unit_test/CMakeFiles/Kokkos_CoreUnitTest_Cuda1.dir/cuda/TestCuda_Other.cpp.o
/home/users/khuck/src/occupancy/kokkos-khuck/core/src/impl/Kokkos_FunctorAnalysis.hpp(189): error: class "Kokkos::Impl::CombinedFunctorReducer<TestCXX11::FunctorReduceTest<Kokkos::Cuda>, Kokkos::Impl::FunctorAnalysis<Kokkos::Impl::FunctorPatternInterface::REDUCE, Kokkos::TeamPolicy<Kokkos::Cuda>, TestCXX11::FunctorReduceTest<Kokkos::Cuda>, double>::Reducer, void>" has no member "operator()"
          detected during:
            instantiation of class "Kokkos::Impl::FunctorAnalysis<PatternInterface, Policy, Functor, OverrideValueType>::deduce_value_type<F, Kokkos::Impl::FunctorAnalysis<PatternInterface, Policy, Functor, OverrideValueType>::REDUCE, void, true> [with PatternInterface=Kokkos::Impl::FunctorPatternInterface::REDUCE, Policy=Kokkos::Impl::TeamPolicyInternal<Kokkos::Cuda::execution_space, Kokkos::Cuda>, Functor=Kokkos::Impl::CombinedFunctorReducer<TestCXX11::FunctorReduceTest<Kokkos::Cuda>, Kokkos::Impl::FunctorAnalysis<Kokkos::Impl::FunctorPatternInterface::REDUCE, Kokkos::TeamPolicy<Kokkos::Cuda>, TestCXX11::FunctorReduceTest<Kokkos::Cuda>, double>::Reducer, void>, OverrideValueType=void, F=Kokkos::Impl::CombinedFunctorReducer<TestCXX11::FunctorReduceTest<Kokkos::Cuda>, Kokkos::Impl::FunctorAnalysis<Kokkos::Impl::FunctorPatternInterface::REDUCE, Kokkos::TeamPolicy<Kokkos::Cuda>, TestCXX11::FunctorReduceTest<Kokkos::Cuda>, double>::Reducer, void>]" 
(290): here
            instantiation of class "Kokkos::Impl::FunctorAnalysis<PatternInterface, Policy, Functor, OverrideValueType> [with PatternInterface=Kokkos::Impl::FunctorPatternInterface::REDUCE, Policy=Kokkos::Impl::TeamPolicyInternal<Kokkos::Cuda::execution_space, Kokkos::Cuda>, Functor=Kokkos::Impl::CombinedFunctorReducer<TestCXX11::FunctorReduceTest<Kokkos::Cuda>, Kokkos::Impl::FunctorAnalysis<Kokkos::Impl::FunctorPatternInterface::REDUCE, Kokkos::TeamPolicy<Kokkos::Cuda>, TestCXX11::FunctorReduceTest<Kokkos::Cuda>, double>::Reducer, void>, OverrideValueType=void]" 
/home/users/khuck/src/occupancy/kokkos-khuck/core/src/Cuda/Kokkos_Cuda_Parallel_Team.hpp(120): here
            instantiation of "int Kokkos::Impl::TeamPolicyInternal<Kokkos::Cuda, Properties...>::team_size_max(const FunctorType &, const Kokkos::ParallelReduceTag &) const [with Properties=<Kokkos::Cuda>, FunctorType=Kokkos::Impl::CombinedFunctorReducer<TestCXX11::FunctorReduceTest<Kokkos::Cuda>, Kokkos::Impl::FunctorAnalysis<Kokkos::Impl::FunctorPatternInterface::REDUCE, Kokkos::TeamPolicy<Kokkos::Cuda>, TestCXX11::FunctorReduceTest<Kokkos::Cuda>, double>::Reducer, void>]" 
/home/users/khuck/src/occupancy/kokkos-khuck/core/src/impl/Kokkos_Profiling.hpp(330): here
            instantiation of "int Kokkos::Tools::Impl::SimpleTeamSizeCalculator::get_max_team_size(const Policy &, const Functor &, Tag) [with Policy=Kokkos::TeamPolicy<Kokkos::Cuda>, Functor=Kokkos::Impl::CombinedFunctorReducer<TestCXX11::FunctorReduceTest<Kokkos::Cuda>, Kokkos::Impl::FunctorAnalysis<Kokkos::Impl::FunctorPatternInterface::REDUCE, Kokkos::TeamPolicy<Kokkos::Cuda>, TestCXX11::FunctorReduceTest<Kokkos::Cuda>, double>::Reducer, void>, Tag=Kokkos::ParallelReduceTag]" 
/home/users/khuck/src/occupancy/kokkos-khuck/core/src/Kokkos_Tuners.hpp(479): here
            instantiation of "Kokkos::Tools::Experimental::TeamSizeTuner::TeamSizeTuner(const std::string &, const Kokkos::TeamPolicy<Properties...> &, const Functor &, const TagType &, ViableConfigurationCalculator) [with ViableConfigurationCalculator=Kokkos::Tools::Impl::SimpleTeamSizeCalculator, Functor=Kokkos::Impl::CombinedFunctorReducer<TestCXX11::FunctorReduceTest<Kokkos::Cuda>, Kokkos::Impl::FunctorAnalysis<Kokkos::Impl::FunctorPatternInterface::REDUCE, Kokkos::TeamPolicy<Kokkos::Cuda>, TestCXX11::FunctorReduceTest<Kokkos::Cuda>, double>::Reducer, void>, TagType=Kokkos::ParallelReduceTag, Properties=<Kokkos::Cuda>]" 
/home/users/khuck/src/occupancy/kokkos-khuck/core/src/Kokkos_Parallel_Reduce.hpp(1510): here
            instantiation of "void Kokkos::Impl::ParallelReduceAdaptor<PolicyType, FunctorType, ReturnType>::execute_impl(const std::string &, const PolicyType &, const FunctorType &, ReturnType &) [with PolicyType=Kokkos::TeamPolicy<Kokkos::Cuda>, FunctorType=TestCXX11::FunctorReduceTest<Kokkos::Cuda>, ReturnType=Kokkos::View<double, Kokkos::HostSpace, Kokkos::MemoryUnmanaged>]" 
/home/users/khuck/src/occupancy/kokkos-khuck/core/src/Kokkos_Parallel_Reduce.hpp(1535): here
            instantiation of "std::enable_if_t<<expression>, void> Kokkos::Impl::ParallelReduceAdaptor<PolicyType, FunctorType, ReturnType>::execute(const std::string &, const PolicyType &, const FunctorType &, ReturnType &) [with PolicyType=Kokkos::TeamPolicy<Kokkos::Cuda>, FunctorType=TestCXX11::FunctorReduceTest<Kokkos::Cuda>, ReturnType=Kokkos::View<double, Kokkos::HostSpace, Kokkos::MemoryUnmanaged>, Dummy=Kokkos::View<double, Kokkos::HostSpace, Kokkos::MemoryUnmanaged>]" 
/home/users/khuck/src/occupancy/kokkos-khuck/core/src/Kokkos_Parallel_Reduce.hpp(1772): here
            instantiation of "std::enable_if_t<<expression>, void> Kokkos::parallel_reduce(const PolicyType &, const FunctorType &, const ReturnType &) [with PolicyType=Kokkos::TeamPolicy<Kokkos::Cuda>, FunctorType=TestCXX11::FunctorReduceTest<Kokkos::Cuda>, ReturnType=Kokkos::View<double, Kokkos::HostSpace, Kokkos::MemoryUnmanaged>]" 
/home/users/khuck/src/occupancy/kokkos-khuck/core/unit_test/TestCXX11.hpp(220): here
            instantiation of "double TestCXX11::ReduceTestFunctor<DeviceType,PWRTest>() [with DeviceType=Kokkos::Cuda, PWRTest=false]" 
/home/users/khuck/src/occupancy/kokkos-khuck/core/unit_test/TestCXX11.hpp(304): here
            instantiation of "double TestCXX11::TestVariantFunctor<DeviceType>(int) [with DeviceType=Kokkos::Cuda]" 
/home/users/khuck/src/occupancy/kokkos-khuck/core/unit_test/TestCXX11.hpp(314): here
            instantiation of "__nv_bool TestCXX11::Test<DeviceType>(int) [with DeviceType=Kokkos::Cuda]" 
/home/users/khuck/src/occupancy/kokkos-khuck/core/unit_test/TestCXX11.hpp(347): here

@masterleinad
Copy link
Contributor

diff --git a/core/src/impl/Kokkos_Profiling.hpp b/core/src/impl/Kokkos_Profiling.hpp
index 04e48454b..be06a5895 100644
--- a/core/src/impl/Kokkos_Profiling.hpp
+++ b/core/src/impl/Kokkos_Profiling.hpp
@@ -329,6 +329,12 @@ struct SimpleTeamSizeCalculator {
                         const Tag tag) {
     auto max = policy.team_size_max(functor, tag);
     return max;
+  }
+    template <typename Policy, typename FunctorReducer>
+  int get_max_team_size(const Policy& policy, const FunctorReducer& functor_reducer,
+                        const Kokkos::ParallelReduceTag tag) {
+    auto max = policy.team_size_max(functor_reducer.get_functor(), functor_reducer.get_reducer(), tag);
+    return max;
   }
   template <typename Policy, typename Functor, typename Tag>
   int get_recommended_team_size(const Policy& policy, const Functor& functor,
@@ -344,13 +350,13 @@ struct SimpleTeamSizeCalculator {
     using driver     = Kokkos::Impl::ParallelFor<Functor, Policy, exec_space>;
     return driver::max_tile_size_product(policy, functor);
   }
-  template <typename Policy, typename Functor>
+  template <typename Policy, typename FunctorReducer>
   int get_mdrange_max_tile_size_product(const Policy& policy,
-                                        const Functor& functor,
+                                        const FunctorReducer& functor_reducer,
                                         const Kokkos::ParallelReduceTag&) {
     using exec_space = typename Policy::execution_space;
-    using driver = Kokkos::Impl::ParallelReduce<Functor, Policy, exec_space>;
-    return driver::max_tile_size_product(policy, functor);
+    using driver = Kokkos::Impl::ParallelReduce<FunctorReducer, Policy, exec_space>;
+    return driver::max_tile_size_product(policy, functor_reducer.get_functor());
   }
 };
 
@@ -360,31 +366,24 @@ struct SimpleTeamSizeCalculator {
 // constructible from a reference to an
 // instance of their value_type so we construct
 // a value_type and temporary reducer here
-template <typename ReducerType>
 struct ComplexReducerSizeCalculator {
-  template <typename Policy, typename Functor, typename Tag>
-  int get_max_team_size(const Policy& policy, const Functor& functor,
+  template <typename Policy, typename FunctorReducer, typename Tag>
+  int get_max_team_size(const Policy& policy, const FunctorReducer& functor_reducer,
                         const Tag tag) {
-    using value_type = typename ReducerType::value_type;
-    value_type value;
-    ReducerType reducer_example = ReducerType(value);
-    return policy.team_size_max(functor, reducer_example, tag);
+    return policy.team_size_max(functor_reducer.get_functor(), functor_reducer.get_reducer(), tag);
   }
-  template <typename Policy, typename Functor, typename Tag>
-  int get_recommended_team_size(const Policy& policy, const Functor& functor,
+  template <typename Policy, typename FunctorReducer, typename Tag>
+  int get_recommended_team_size(const Policy& policy, const FunctorReducer& functor_reducer,
                                 const Tag tag) {
-    using value_type = typename ReducerType::value_type;
-    value_type value;
-    ReducerType reducer_example = ReducerType(value);
-    return policy.team_size_recommended(functor, reducer_example, tag);
+    return policy.team_size_recommended(functor_reducer.get_functor(), functor_reducer.get_reducer(), tag);
   }
-  template <typename Policy, typename Functor>
+  template <typename Policy, typename FunctorReducer>
   int get_mdrange_max_tile_size_product(const Policy& policy,
-                                        const Functor& functor,
+                                        const FunctorReducer& functor_reducer,
                                         const Kokkos::ParallelReduceTag&) {
     using exec_space = typename Policy::execution_space;
-    using driver = Kokkos::Impl::ParallelReduce<Functor, Policy, exec_space>;
-    return driver::max_tile_size_product(policy, functor);
+    using driver = Kokkos::Impl::ParallelReduce<FunctorReducer, Policy, exec_space>;
+    return driver::max_tile_size_product(policy, functor_reducer.get_functor());
   }
 };
 
@@ -442,7 +441,7 @@ auto generic_tune_policy(const std::string& label_in, Map& map,
         return (map.emplace(
                        label,
                        Tuner(label, policy, functor, tag,
-                             Impl::ComplexReducerSizeCalculator<ReducerType>{}))
+                             Impl::ComplexReducerSizeCalculator{}))
                     .first);
       }
       return my_tuner;

should do the trick.

@khuck
Copy link
Author

khuck commented Apr 12, 2024

@masterleinad yup, that did the trick, thanks! I'm rerunning as many tests as I can think of before I push the change...

Copy link
Contributor

@masterleinad masterleinad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good to me. Any chance that we can test this?

};
} // namespace Impl
template <class Bound>
class SingleDimensionalRangeTuner {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What else do you envision this class to be used for?

Comment on lines 606 to 611
RangePolicyOccupancyTuner& operator=(const RangePolicyOccupancyTuner& other) =
default;
RangePolicyOccupancyTuner(const RangePolicyOccupancyTuner& other) = default;
RangePolicyOccupancyTuner& operator=(RangePolicyOccupancyTuner&& other) =
default;
RangePolicyOccupancyTuner(RangePolicyOccupancyTuner&& other) = default;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
RangePolicyOccupancyTuner& operator=(const RangePolicyOccupancyTuner& other) =
default;
RangePolicyOccupancyTuner(const RangePolicyOccupancyTuner& other) = default;
RangePolicyOccupancyTuner& operator=(RangePolicyOccupancyTuner&& other) =
default;
RangePolicyOccupancyTuner(RangePolicyOccupancyTuner&& other) = default;

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change breaks compilation...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. How is it failing?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/usr/include/c++/8/tuple(1668): error: no instance of constructor "Kokkos::Tools::Experimental::RangePolicyOccupancyTuner::RangePolicyOccupancyTuner" matches the argument list
          detected during:
            instantiation of "std::pair<_T1, _T2>::pair(std::tuple<_Args1...> &, std::tuple<_Args2...> &, std::_Index_tuple<_Indexes1...>, std::_Index_tuple<_Indexes2...>) [with _T1=const std::__cxx11::string, _T2=Kokkos::Tools::Experimental::RangePolicyOccupancyTuner, _Args1=<const std::__cxx11::string &>, _Indexes1=<0UL>, _Args2=<>, _Indexes2=<>]" 
(1658): here
            instantiation of "std::pair<_T1, _T2>::pair(std::piecewise_construct_t, std::tuple<_Args1...>, std::tuple<_Args2...>) [with _T1=const std::__cxx11::string, _T2=Kokkos::Tools::Experimental::RangePolicyOccupancyTuner, _Args1=<const std::__cxx11::string &>, _Args2=<>]" 
/usr/include/c++/8/ext/new_allocator.h(136): here
            instantiation of "void __gnu_cxx::new_allocator<_Tp>::construct(_Up *, _Args &&...) [with _Tp=std::_Rb_tree_node<std::pair<const std::__cxx11::string, Kokkos::Tools::Experimental::RangePolicyOccupancyTuner>>, _Up=std::pair<const std::__cxx11::string, Kokkos::Tools::Experimental::RangePolicyOccupancyTuner>, _Args=<const std::piecewise_construct_t &, std::tuple<const std::__cxx11::string &>, std::tuple<>>]" 
/usr/include/c++/8/bits/alloc_traits.h(475): here
            instantiation of "void std::allocator_traits<std::allocator<_Tp>>::construct(std::allocator_traits<std::allocator<_Tp>>::allocator_type &, _Up *, _Args &&...) [with _Tp=std::_Rb_tree_node<std::pair<const std::__cxx11::string, Kokkos::Tools::Experimental::RangePolicyOccupancyTuner>>, _Up=std::pair<const std::__cxx11::string, Kokkos::Tools::Experimental::RangePolicyOccupancyTuner>, _Args=<const std::piecewise_construct_t &, std::tuple<const std::__cxx11::string &>, std::tuple<>>]" 
/usr/include/c++/8/bits/stl_tree.h(628): here
            instantiation of "void std::_Rb_tree<_Key, _Val, _KeyOfValue, _Compare, _Alloc>::_M_construct_node(std::_Rb_tree<_Key, _Val, _KeyOfValue, _Compare, _Alloc>::_Link_type, _Args &&...) [with _Key=std::__cxx11::string, _Val=std::pair<const std::__cxx11::string, Kokkos::Tools::Experimental::RangePolicyOccupancyTuner>, _KeyOfValue=std::_Select1st<std::pair<const std::__cxx11::string, Kokkos::Tools::Experimental::RangePolicyOccupancyTuner>>, _Compare=std::less<std::__cxx11::string>, _Alloc=std::allocator<std::pair<const std::__cxx11::string, Kokkos::Tools::Experimental::RangePolicyOccupancyTuner>>, _Args=<const std::piecewise_construct_t &, std::tuple<const std::__cxx11::string &>, std::tuple<>>]" 
/usr/include/c++/8/bits/stl_tree.h(643): here
            [ 2 instantiation contexts not shown ]
            instantiation of "std::map<_Key, _Tp, _Compare, _Alloc>::mapped_type &std::map<_Key, _Tp, _Compare, _Alloc>::operator[](const std::map<_Key, _Tp, _Compare, _Alloc>::key_type &) [with _Key=std::__cxx11::string, _Tp=Kokkos::Tools::Experimental::RangePolicyOccupancyTuner, _Compare=std::less<std::__cxx11::string>, _Alloc=std::allocator<std::pair<const std::__cxx11::string, Kokkos::Tools::Experimental::RangePolicyOccupancyTuner>>]" 
/home/users/khuck/src/occupancy/kokkos/core/src/impl/Kokkos_Profiling.hpp(638): here
            instantiation of "void Kokkos::Tools::Impl::generic_report_results<Tuner,Functor,TagType,TuningPermissionFunctor,Map,Policy>(const std::__cxx11::string &, Map &, const Policy &, const Functor &, const TagType &, const TuningPermissionFunctor &) [with Tuner=Kokkos::Tools::Experimental::RangePolicyOccupancyTuner, Functor=lambda [](ptrdiff_t)->void, TagType=Kokkos::ParallelForTag, TuningPermissionFunctor=lambda [](const Policy &)->__nv_bool, Map=std::map<std::__cxx11::string, Kokkos::Tools::Experimental::RangePolicyOccupancyTuner, std::less<std::__cxx11::string>, std::allocator<std::pair<const std::__cxx11::string, Kokkos::Tools::Experimental::RangePolicyOccupancyTuner>>>, Policy=Kokkos::RangePolicy<Kokkos::DefaultHostExecutionSpace>]" 
/home/users/khuck/src/occupancy/kokkos/core/src/impl/Kokkos_Profiling.hpp(688): here
            instantiation of "void Kokkos::Tools::Impl::report_policy_results(size_t, const std::__cxx11::string &, const Kokkos::RangePolicy<Properties...> &, const Functor &, const TagType &) [with Functor=lambda [](ptrdiff_t)->void, TagType=Kokkos::ParallelForTag, Properties=<Kokkos::DefaultHostExecutionSpace>]" 
/home/users/khuck/src/occupancy/kokkos/core/src/impl/Kokkos_Profiling.hpp(726): here
            instantiation of "void Kokkos::Tools::Impl::end_parallel_for(const ExecPolicy &, FunctorType &, const std::__cxx11::string &, uint64_t &) [with ExecPolicy=Kokkos::RangePolicy<Kokkos::DefaultHostExecutionSpace>, FunctorType=const lambda [](ptrdiff_t)->void]" 
/home/users/khuck/src/occupancy/kokkos/core/src/Kokkos_Parallel.hpp(148): here
            instantiation of "void Kokkos::parallel_for(const std::__cxx11::string &, const ExecPolicy &, const FunctorType &) [with ExecPolicy=Kokkos::RangePolicy<Kokkos::DefaultHostExecutionSpace>, FunctorType=lambda [](ptrdiff_t)->void, Enable=void]" 
/home/users/khuck/src/occupancy/kokkos/core/src/impl/Kokkos_HostSpace_deepcopy.cpp(81): here

1 error detected in the compilation of "/home/users/khuck/src/occupancy/kokkos/core/src/impl/Kokkos_HostSpace_deepcopy.cpp".
gmake[2]: *** [kokkos/core/src/CMakeFiles/kokkoscore.dir/build.make:188: kokkos/core/src/CMakeFiles/kokkoscore.dir/impl/Kokkos_HostSpace_deepcopy.cpp.o] Error 1
gmake[2]: *** Waiting for unfinished jobs....
gmake[1]: *** [CMakeFiles/Makefile2:1254: kokkos/core/src/CMakeFiles/kokkoscore.dir/all] Error 2
gmake: *** [Makefile:136: all] Error 2

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see similar errors when I remove the default constructor but not when removing the defaulted special member functions.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, you're right, I deleted one too many lines...

Comment on lines +366 to +395
// when we have a complex reducer, we need to pass an
// instance to team_size_recommended/max. Reducers
// aren't default constructible, but they are
// constructible from a reference to an
// instance of their value_type so we construct
// a value_type and temporary reducer here
struct ComplexReducerSizeCalculator {
template <typename Policy, typename FunctorReducer, typename Tag>
int get_max_team_size(const Policy& policy,
const FunctorReducer& functor_reducer, const Tag tag) {
return policy.team_size_max(functor_reducer.get_functor(),
functor_reducer.get_reducer(), tag);
}
template <typename Policy, typename FunctorReducer, typename Tag>
int get_recommended_team_size(const Policy& policy,
const FunctorReducer& functor_reducer,
const Tag tag) {
return policy.team_size_recommended(functor_reducer.get_functor(),
functor_reducer.get_reducer(), tag);
}
template <typename Policy, typename FunctorReducer>
int get_mdrange_max_tile_size_product(const Policy& policy,
const FunctorReducer& functor_reducer,
const Kokkos::ParallelReduceTag&) {
using exec_space = typename Policy::execution_space;
using driver =
Kokkos::Impl::ParallelReduce<FunctorReducer, Policy, exec_space>;
return driver::max_tile_size_product(policy, functor_reducer.get_functor());
}
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be able to just use SimpleTeamSizeCalculator now that ParallelReduce always takes a FunctorReducer argument.
We could do this cleanup in a follow-up pull request, of course.

core/src/impl/Kokkos_Profiling.hpp Outdated Show resolved Hide resolved
@khuck
Copy link
Author

khuck commented Apr 15, 2024

Looks pretty good to me. Any chance that we can test this?

I will see what I can do to make sure I test the MDRange, Team and Range policies for the occupancy tuning... I know I have tested the Range with a big Arbor-X example but it would be good to have some small tests for the Kokkos build, if possible. I am not sure how that would work without including APEX, though...

@khuck
Copy link
Author

khuck commented Apr 19, 2024

@masterleinad @Rombur Here's a GitHub repo that includes six test programs and will build Kokkos and APEX to test the tuning. I know it's not a Kokkos unit test, but this at least will test everything all together. It's possible this could be integrated into the kokkos-tools build, which also builds APEX. https://github.com/khuck/apex-kokkos-tuning
In particular, https://github.com/khuck/apex-kokkos-tuning/blob/aa7415e07d74fe76f4d5274c9c835ff517352aa9/tests/CMakeLists.txt#L20-L22 shows the apex_exec command line to run the tuning. The next step after this one is to integrate the RangePolicy tuning that is another branch of David's forked repository.

@masterleinad
Copy link
Contributor

@masterleinad @Rombur Here's a GitHub repo that includes six test programs and will build Kokkos and APEX to test the tuning. I know it's not a Kokkos unit test, but this at least will test everything all together. It's possible this could be integrated into the kokkos-tools build, which also builds APEX. khuck/apex-kokkos-tuning

That might be a good solution.

@Rombur
Copy link
Member

Rombur commented Apr 19, 2024

@khuck thanks for the tests. We can put them in Kokkos-Tools or in our nightly.
Looking at the tests, it seems the tests are just running but they are no check. I understand that the tuning will vary with the hardware but do you think that we can check that the values are in a certain range? What worries me is that one day we break sth and we only try 100% of occupancy. With the current tests, we wouldn't realize that there is an issue, right?

@khuck
Copy link
Author

khuck commented Apr 19, 2024

@Rombur good point, I should have added some validation. yes, there will be some stochastic behavior in the results, but I can at least check for convergence in the output text or in the output convergence result file. Let me add that...

@khuck
Copy link
Author

khuck commented Apr 22, 2024

@Rombur good point, I should have added some validation. yes, there will be some stochastic behavior in the results, but I can at least check for convergence in the output text or in the output convergence result file. Let me add that...

OK! tests are updated and expanded. I now test all the benchmarks with all the different tuning options: no tuning, exhaustive, random, genetic_search and simulated_annealing. Each test then uses the cached/converged results (if converged) to run. Then there is a fake test to clean up the converged results for the next test. I tested on a generic CUDA test machine with cuda/11.7 and Perlmutter with default modules (cuda 12.2 and gcc 12.3). I did find one problem on Frontier, though - when I run the test test_mdrange_gemm_occupancy_no_tuning, I get a segmentation fault in hip. Interestingly, it's only that test, and only when not using APEX or autotuning - just running the build/tests/mdrange_gemm_occupancy causes the crash. I assume it has something to do with this line (requesting the AUTO occupancy, which I did not think is implemented in HIP?...) https://github.com/khuck/apex-kokkos-tuning/blob/718f20e4c0f7e21db768c25caddd0c6ee0dd15ca/tests/mdrange_gemm_occupancy.cpp#L65-L66

@Rombur
Copy link
Member

Rombur commented Apr 23, 2024

hmm I have no idea why it segfaults. Can you get a backtrace?

@khuck
Copy link
Author

khuck commented Apr 23, 2024

hmm I have no idea why it segfaults. Can you get a backtrace?

OK, it's fixed. the problem was with the benchmark, I guess I don't totally understand how some of this boilerplate that David Poliakof wrote works. Apparently if the size of the View is smaller than the number of loop iterations, indexes larger than the view are passed in to the parallel_for loop. The problem was in the test, not Kokkos.

@masterleinad
Copy link
Contributor

Apparently if the size of the View is smaller than the number of loop iterations, indexes larger than the view are passed in to the parallel_for loop.

That sounds weird. The tuning should not modify the number of work items, right?

@khuck
Copy link
Author

khuck commented Apr 23, 2024

Apparently if the size of the View is smaller than the number of loop iterations, indexes larger than the view are passed in to the parallel_for loop.

That sounds weird. The tuning should not modify the number of work items, right?

correct. the problem isn't with the tuning in kokkos or in apex, but the tuning_playground.hpp file. I am still looking into it...

@khuck
Copy link
Author

khuck commented Apr 23, 2024

Apparently if the size of the View is smaller than the number of loop iterations, indexes larger than the view are passed in to the parallel_for loop.

That sounds weird. The tuning should not modify the number of work items, right?

That's correct, and it does not. The problem is the test program, I think I have it fixed...

@khuck
Copy link
Author

khuck commented Apr 23, 2024

@Rombur @masterleinad Ok! all tests are passing and working correctly. I tested on frontier and Perlmutter and my V100 system. I pushed the updates to the apex-kokkos-tuning repo, you should be able to build & run that - if not, let me know.
By the way, the "idk_just_mm_no_tuning" test fails on frontier due to a 180 second timeout (I did confirm that it will run to completion if you let it). The bad team-based gemm (https://github.com/khuck/apex-kokkos-tuning/blob/737da07df67840a07bb04cfa6923a3ac59cb376c/tests/idk_jmm.cpp#L64-L83) is really slow on MI250X. It's not fast on cuda either, but at least it completes in 80-110 seconds (depending on V100 or A100)

Copy link
Member

@Rombur Rombur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks fine. Once this is merged I'll add the tests in our nightly.

Kokkos::ParallelForTag{});
}
#else
(void)policy;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could use [[maybe_unused]] instead of having the #else branch,

@khuck
Copy link
Author

khuck commented Apr 26, 2024

@Rombur @masterleinad unfortunately, I seem to have found a logic bug in how tuning contexts are incremented. There is a simple fix, but I need to do a bit more testing with it...

Minor fix to prevent incrementing the context id index when not calling `context_begin()`. In actuality, this should be refactored so that `begin_context()` increments the id, and returns it. `end_context()` is the only location that decrements the context id index.
@khuck
Copy link
Author

khuck commented Apr 26, 2024

@Rombur @masterleinad unfortunately, I seem to have found a logic bug in how tuning contexts are incremented. There is a simple fix, but I need to do a bit more testing with it...

Simple fix committed!

@khuck
Copy link
Author

khuck commented May 7, 2024

Is there anything else that needs to be done with this PR so it can be merged?

@Rombur
Copy link
Member

Rombur commented May 7, 2024

@crtrott @dalg24 I think this is ready to be merged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants