New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding occupancy tuning for CUDA architectures #6788
base: develop
Are you sure you want to change the base?
Conversation
@@ -450,7 +450,7 @@ class CudaInternal { | |||
cudaError_t cuda_func_set_attributes_wrapper(T* entry, cudaFuncAttribute attr, | |||
int value) const { | |||
if constexpr (setCudaDevice) set_cuda_device(); | |||
return cudaFuncSetAttributes(entry, attr, value); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in #6786.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does that mean I should revert this fix in my PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just rebase it on top of develop
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
#include <Kokkos_Core_fwd.hpp> | ||
#include <Kokkos_ExecPolicy.hpp> | ||
#include <Kokkos_Macros.hpp> | ||
#include <Kokkos_Tuners.hpp> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs
#include <Kokkos_Core_fwd.hpp> | |
#include <Kokkos_ExecPolicy.hpp> | |
#include <Kokkos_Macros.hpp> | |
#include <Kokkos_Tuners.hpp> | |
#ifndef KOKKOS_IMPL_PUBLIC_INCLUDE | |
#define KOKKOS_IMPL_PUBLIC_INCLUDE | |
#define KOKKOS_IMPL_PUBLIC_INCLUDE_NOTDEFINED_PROFILING | |
#endif | |
#include <Kokkos_Core_fwd.hpp> | |
#include <Kokkos_ExecPolicy.hpp> | |
#include <Kokkos_Macros.hpp> | |
#include <Kokkos_Tuners.hpp> |
and
#ifdef KOKKOS_IMPL_PUBLIC_INCLUDE_NOTDEFINED_PROFILING
#undef KOKKOS_IMPL_PUBLIC_INCLUDE
#undef KOKKOS_IMPL_PUBLIC_INCLUDE_NOTDEFINED_PROFILING
#endif
at the end of the file since we are now including non-public headers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
Something is strange with PR. 87 files have been changed and my browser crashed when I tried to look at the difference. Can you try to rebase on develop once again. |
Note: This is a re-commit of a somehow polluted branch when I rebased on develop. I started over with the 5 changed files. The old Kokkos fork/branch from : davidp git@github.com:DavidPoliakoff/kokkos.git (fetch) was merged with current Kokkos develop, and tested with ArborX to confirm that autotuning occupancy for the DBSCAN benchmark worked. In tests on a system with V100, the original benchmark when iterated 600 times took 119.064 seconds to run. During the tuning process (using simulated annealing), the runtime was 108.014 seconds. When using cached results, the runtime was 109.058 seconds. The converged occupancy value was 70. Here are the cached results from APEX autotuning: Input_1: name: kokkos.kernel_name id: 1 info.type: string info.category: categorical info.valueQuantity: unbounded info.candidates: unbounded num_bins: 0 Input_2: name: kokkos.kernel_type id: 2 info.type: string info.category: categorical info.valueQuantity: set info.candidates: [parallel_for,parallel_reduce,parallel_scan,parallel_copy] Output_3: name: ArborX::Experimental::HalfTraversal id: 3 info.type: int64 info.category: ratio info.valueQuantity: range info.candidates: lower: 5 upper: 100 step: 5 open upper: 0 open lower: 0 Context_0: Name: "[2:parallel_for,1:ArborX::Experimental::HalfTraversal,tree_node:default]" Converged: true Results: NumVars: 1 id: 3 value: 70 In manual experiments, the ArborX team determined that the optimal occupancy for this example was beetween 40-90, which were a 10% improvement over baseline default of 100. See arborx/ArborX#815 for details. One deviation from the branch that David had written - the occupancy range is [5-100], with a step size of 5. The original implementation in Kokkos used [1-100] with a step size of 1.
17d0af0
to
8dba118
Compare
@Rombur @masterleinad I re-committed the branch on the current develop, there shouldn't be a corrupted commit history now. Let me know if you need any other changes. Thanks! |
The two configurations that have TUNING enable are failing. I'll try to look at the issue when I have time. |
The two failures I see are related to something specific due to linking fortran on fedora and clang, I have no idea what that's about... I was able to reproduce it with a docker image of my own but I don't think it's related? |
How did you reproduce? What images were you using? |
@khuck I am talking about this and this
|
Oh, I was looking at the wrong failure. I was running the fedora:41 clang++ test, not the last one on the list... |
It looks like the issue is with CUDA Clang. The failures are from |
I forgot to turn tuning on... Now I can reproduce the issue. |
@Rombur are you able to reproduce it with a newer clang and cuda? because it has taken me a while to get clang 12.0.0 and cuda 11 together on the same machine, but I can now reproduce the build error... |
Do you think it is something specific to |
diff --git a/core/src/Kokkos_Parallel_Reduce.hpp b/core/src/Kokkos_Parallel_Reduce.hpp
index 39cb87acd..c656f72c2 100644
--- a/core/src/Kokkos_Parallel_Reduce.hpp
+++ b/core/src/Kokkos_Parallel_Reduce.hpp
@@ -1492,12 +1492,6 @@ struct ParallelReduceAdaptor {
using PassedReducerType = typename return_value_adapter::reducer_type;
uint64_t kpID = 0;
- /** Request a tuned policy from the tools subsystem */
- auto response = Kokkos::Tools::Impl::begin_parallel_reduce<
- typename return_value_adapter::reducer_type>(policy, functor, label,
- kpID);
- auto& inner_policy = response.policy;
-
using ReducerSelector =
Kokkos::Impl::if_c<std::is_same<InvalidType, PassedReducerType>::value,
FunctorType, PassedReducerType>;
@@ -1509,7 +1503,12 @@ struct ParallelReduceAdaptor {
functor, typename Analysis::Reducer(
ReducerSelector::select(functor, return_value)));
- // FIXME Remove "Wrapper" once all backends implement the new interface
+ /** Request a tuned policy from the tools subsystem */
+ auto response = Kokkos::Tools::Impl::begin_parallel_reduce<
+ typename return_value_adapter::reducer_type>(policy, functor_reducer, label,
+ kpID);
+ auto& inner_policy = response.policy;
+
Impl::ParallelReduce<decltype(functor_reducer), PolicyType,
typename Impl::FunctorPolicyExecutionSpace<
FunctorType, PolicyType>::execution_space> should fix it. |
@khuck Yes, I can reproduce the error with CUDA 12.4 and clang 17.0.6. @masterleinad patch fixes one issue but there are others. |
@masterleinad your fix helps, but doesn't fix everything. @Rombur Actually, it isn't just clang + cuda, I can reproduce it with gcc 9.4.0 when building the tests... |
this also helps some:
|
I think I found a clue... the remaining issues seem to be related to the fact that for MDRange reductions, there isn't a CombinedFunctorReducer getting created somewhere, but that's what is expected. Instead, Kokkos::Impl::ParallelReduce<> is getting passed a Functor... |
…and OpenMP Reduce tuning. Now trying to fix Team tuning...
OK, using @masterleinad 's fix I was able to get past the problems with MDRange. Now I am having an issue with Teams and a simple reduce. When the CombinedFunctorReducer is constructed, the Reducer returned by this call https://github.com/khuck/kokkos/blob/e0dc0128e04f18c2bbbaefceef3616e7ddcfa3c4/core/src/Kokkos_Parallel_Reduce.hpp#L1506-L1508 doesn't have an
|
diff --git a/core/src/impl/Kokkos_Profiling.hpp b/core/src/impl/Kokkos_Profiling.hpp
index 04e48454b..be06a5895 100644
--- a/core/src/impl/Kokkos_Profiling.hpp
+++ b/core/src/impl/Kokkos_Profiling.hpp
@@ -329,6 +329,12 @@ struct SimpleTeamSizeCalculator {
const Tag tag) {
auto max = policy.team_size_max(functor, tag);
return max;
+ }
+ template <typename Policy, typename FunctorReducer>
+ int get_max_team_size(const Policy& policy, const FunctorReducer& functor_reducer,
+ const Kokkos::ParallelReduceTag tag) {
+ auto max = policy.team_size_max(functor_reducer.get_functor(), functor_reducer.get_reducer(), tag);
+ return max;
}
template <typename Policy, typename Functor, typename Tag>
int get_recommended_team_size(const Policy& policy, const Functor& functor,
@@ -344,13 +350,13 @@ struct SimpleTeamSizeCalculator {
using driver = Kokkos::Impl::ParallelFor<Functor, Policy, exec_space>;
return driver::max_tile_size_product(policy, functor);
}
- template <typename Policy, typename Functor>
+ template <typename Policy, typename FunctorReducer>
int get_mdrange_max_tile_size_product(const Policy& policy,
- const Functor& functor,
+ const FunctorReducer& functor_reducer,
const Kokkos::ParallelReduceTag&) {
using exec_space = typename Policy::execution_space;
- using driver = Kokkos::Impl::ParallelReduce<Functor, Policy, exec_space>;
- return driver::max_tile_size_product(policy, functor);
+ using driver = Kokkos::Impl::ParallelReduce<FunctorReducer, Policy, exec_space>;
+ return driver::max_tile_size_product(policy, functor_reducer.get_functor());
}
};
@@ -360,31 +366,24 @@ struct SimpleTeamSizeCalculator {
// constructible from a reference to an
// instance of their value_type so we construct
// a value_type and temporary reducer here
-template <typename ReducerType>
struct ComplexReducerSizeCalculator {
- template <typename Policy, typename Functor, typename Tag>
- int get_max_team_size(const Policy& policy, const Functor& functor,
+ template <typename Policy, typename FunctorReducer, typename Tag>
+ int get_max_team_size(const Policy& policy, const FunctorReducer& functor_reducer,
const Tag tag) {
- using value_type = typename ReducerType::value_type;
- value_type value;
- ReducerType reducer_example = ReducerType(value);
- return policy.team_size_max(functor, reducer_example, tag);
+ return policy.team_size_max(functor_reducer.get_functor(), functor_reducer.get_reducer(), tag);
}
- template <typename Policy, typename Functor, typename Tag>
- int get_recommended_team_size(const Policy& policy, const Functor& functor,
+ template <typename Policy, typename FunctorReducer, typename Tag>
+ int get_recommended_team_size(const Policy& policy, const FunctorReducer& functor_reducer,
const Tag tag) {
- using value_type = typename ReducerType::value_type;
- value_type value;
- ReducerType reducer_example = ReducerType(value);
- return policy.team_size_recommended(functor, reducer_example, tag);
+ return policy.team_size_recommended(functor_reducer.get_functor(), functor_reducer.get_reducer(), tag);
}
- template <typename Policy, typename Functor>
+ template <typename Policy, typename FunctorReducer>
int get_mdrange_max_tile_size_product(const Policy& policy,
- const Functor& functor,
+ const FunctorReducer& functor_reducer,
const Kokkos::ParallelReduceTag&) {
using exec_space = typename Policy::execution_space;
- using driver = Kokkos::Impl::ParallelReduce<Functor, Policy, exec_space>;
- return driver::max_tile_size_product(policy, functor);
+ using driver = Kokkos::Impl::ParallelReduce<FunctorReducer, Policy, exec_space>;
+ return driver::max_tile_size_product(policy, functor_reducer.get_functor());
}
};
@@ -442,7 +441,7 @@ auto generic_tune_policy(const std::string& label_in, Map& map,
return (map.emplace(
label,
Tuner(label, policy, functor, tag,
- Impl::ComplexReducerSizeCalculator<ReducerType>{}))
+ Impl::ComplexReducerSizeCalculator{}))
.first);
}
return my_tuner; should do the trick. |
@masterleinad yup, that did the trick, thanks! I'm rerunning as many tests as I can think of before I push the change... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks pretty good to me. Any chance that we can test this?
}; | ||
} // namespace Impl | ||
template <class Bound> | ||
class SingleDimensionalRangeTuner { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What else do you envision this class to be used for?
core/src/Kokkos_Tuners.hpp
Outdated
RangePolicyOccupancyTuner& operator=(const RangePolicyOccupancyTuner& other) = | ||
default; | ||
RangePolicyOccupancyTuner(const RangePolicyOccupancyTuner& other) = default; | ||
RangePolicyOccupancyTuner& operator=(RangePolicyOccupancyTuner&& other) = | ||
default; | ||
RangePolicyOccupancyTuner(RangePolicyOccupancyTuner&& other) = default; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RangePolicyOccupancyTuner& operator=(const RangePolicyOccupancyTuner& other) = | |
default; | |
RangePolicyOccupancyTuner(const RangePolicyOccupancyTuner& other) = default; | |
RangePolicyOccupancyTuner& operator=(RangePolicyOccupancyTuner&& other) = | |
default; | |
RangePolicyOccupancyTuner(RangePolicyOccupancyTuner&& other) = default; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change breaks compilation...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting. How is it failing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/usr/include/c++/8/tuple(1668): error: no instance of constructor "Kokkos::Tools::Experimental::RangePolicyOccupancyTuner::RangePolicyOccupancyTuner" matches the argument list
detected during:
instantiation of "std::pair<_T1, _T2>::pair(std::tuple<_Args1...> &, std::tuple<_Args2...> &, std::_Index_tuple<_Indexes1...>, std::_Index_tuple<_Indexes2...>) [with _T1=const std::__cxx11::string, _T2=Kokkos::Tools::Experimental::RangePolicyOccupancyTuner, _Args1=<const std::__cxx11::string &>, _Indexes1=<0UL>, _Args2=<>, _Indexes2=<>]"
(1658): here
instantiation of "std::pair<_T1, _T2>::pair(std::piecewise_construct_t, std::tuple<_Args1...>, std::tuple<_Args2...>) [with _T1=const std::__cxx11::string, _T2=Kokkos::Tools::Experimental::RangePolicyOccupancyTuner, _Args1=<const std::__cxx11::string &>, _Args2=<>]"
/usr/include/c++/8/ext/new_allocator.h(136): here
instantiation of "void __gnu_cxx::new_allocator<_Tp>::construct(_Up *, _Args &&...) [with _Tp=std::_Rb_tree_node<std::pair<const std::__cxx11::string, Kokkos::Tools::Experimental::RangePolicyOccupancyTuner>>, _Up=std::pair<const std::__cxx11::string, Kokkos::Tools::Experimental::RangePolicyOccupancyTuner>, _Args=<const std::piecewise_construct_t &, std::tuple<const std::__cxx11::string &>, std::tuple<>>]"
/usr/include/c++/8/bits/alloc_traits.h(475): here
instantiation of "void std::allocator_traits<std::allocator<_Tp>>::construct(std::allocator_traits<std::allocator<_Tp>>::allocator_type &, _Up *, _Args &&...) [with _Tp=std::_Rb_tree_node<std::pair<const std::__cxx11::string, Kokkos::Tools::Experimental::RangePolicyOccupancyTuner>>, _Up=std::pair<const std::__cxx11::string, Kokkos::Tools::Experimental::RangePolicyOccupancyTuner>, _Args=<const std::piecewise_construct_t &, std::tuple<const std::__cxx11::string &>, std::tuple<>>]"
/usr/include/c++/8/bits/stl_tree.h(628): here
instantiation of "void std::_Rb_tree<_Key, _Val, _KeyOfValue, _Compare, _Alloc>::_M_construct_node(std::_Rb_tree<_Key, _Val, _KeyOfValue, _Compare, _Alloc>::_Link_type, _Args &&...) [with _Key=std::__cxx11::string, _Val=std::pair<const std::__cxx11::string, Kokkos::Tools::Experimental::RangePolicyOccupancyTuner>, _KeyOfValue=std::_Select1st<std::pair<const std::__cxx11::string, Kokkos::Tools::Experimental::RangePolicyOccupancyTuner>>, _Compare=std::less<std::__cxx11::string>, _Alloc=std::allocator<std::pair<const std::__cxx11::string, Kokkos::Tools::Experimental::RangePolicyOccupancyTuner>>, _Args=<const std::piecewise_construct_t &, std::tuple<const std::__cxx11::string &>, std::tuple<>>]"
/usr/include/c++/8/bits/stl_tree.h(643): here
[ 2 instantiation contexts not shown ]
instantiation of "std::map<_Key, _Tp, _Compare, _Alloc>::mapped_type &std::map<_Key, _Tp, _Compare, _Alloc>::operator[](const std::map<_Key, _Tp, _Compare, _Alloc>::key_type &) [with _Key=std::__cxx11::string, _Tp=Kokkos::Tools::Experimental::RangePolicyOccupancyTuner, _Compare=std::less<std::__cxx11::string>, _Alloc=std::allocator<std::pair<const std::__cxx11::string, Kokkos::Tools::Experimental::RangePolicyOccupancyTuner>>]"
/home/users/khuck/src/occupancy/kokkos/core/src/impl/Kokkos_Profiling.hpp(638): here
instantiation of "void Kokkos::Tools::Impl::generic_report_results<Tuner,Functor,TagType,TuningPermissionFunctor,Map,Policy>(const std::__cxx11::string &, Map &, const Policy &, const Functor &, const TagType &, const TuningPermissionFunctor &) [with Tuner=Kokkos::Tools::Experimental::RangePolicyOccupancyTuner, Functor=lambda [](ptrdiff_t)->void, TagType=Kokkos::ParallelForTag, TuningPermissionFunctor=lambda [](const Policy &)->__nv_bool, Map=std::map<std::__cxx11::string, Kokkos::Tools::Experimental::RangePolicyOccupancyTuner, std::less<std::__cxx11::string>, std::allocator<std::pair<const std::__cxx11::string, Kokkos::Tools::Experimental::RangePolicyOccupancyTuner>>>, Policy=Kokkos::RangePolicy<Kokkos::DefaultHostExecutionSpace>]"
/home/users/khuck/src/occupancy/kokkos/core/src/impl/Kokkos_Profiling.hpp(688): here
instantiation of "void Kokkos::Tools::Impl::report_policy_results(size_t, const std::__cxx11::string &, const Kokkos::RangePolicy<Properties...> &, const Functor &, const TagType &) [with Functor=lambda [](ptrdiff_t)->void, TagType=Kokkos::ParallelForTag, Properties=<Kokkos::DefaultHostExecutionSpace>]"
/home/users/khuck/src/occupancy/kokkos/core/src/impl/Kokkos_Profiling.hpp(726): here
instantiation of "void Kokkos::Tools::Impl::end_parallel_for(const ExecPolicy &, FunctorType &, const std::__cxx11::string &, uint64_t &) [with ExecPolicy=Kokkos::RangePolicy<Kokkos::DefaultHostExecutionSpace>, FunctorType=const lambda [](ptrdiff_t)->void]"
/home/users/khuck/src/occupancy/kokkos/core/src/Kokkos_Parallel.hpp(148): here
instantiation of "void Kokkos::parallel_for(const std::__cxx11::string &, const ExecPolicy &, const FunctorType &) [with ExecPolicy=Kokkos::RangePolicy<Kokkos::DefaultHostExecutionSpace>, FunctorType=lambda [](ptrdiff_t)->void, Enable=void]"
/home/users/khuck/src/occupancy/kokkos/core/src/impl/Kokkos_HostSpace_deepcopy.cpp(81): here
1 error detected in the compilation of "/home/users/khuck/src/occupancy/kokkos/core/src/impl/Kokkos_HostSpace_deepcopy.cpp".
gmake[2]: *** [kokkos/core/src/CMakeFiles/kokkoscore.dir/build.make:188: kokkos/core/src/CMakeFiles/kokkoscore.dir/impl/Kokkos_HostSpace_deepcopy.cpp.o] Error 1
gmake[2]: *** Waiting for unfinished jobs....
gmake[1]: *** [CMakeFiles/Makefile2:1254: kokkos/core/src/CMakeFiles/kokkoscore.dir/all] Error 2
gmake: *** [Makefile:136: all] Error 2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see similar errors when I remove the default constructor but not when removing the defaulted special member functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops, you're right, I deleted one too many lines...
// when we have a complex reducer, we need to pass an | ||
// instance to team_size_recommended/max. Reducers | ||
// aren't default constructible, but they are | ||
// constructible from a reference to an | ||
// instance of their value_type so we construct | ||
// a value_type and temporary reducer here | ||
struct ComplexReducerSizeCalculator { | ||
template <typename Policy, typename FunctorReducer, typename Tag> | ||
int get_max_team_size(const Policy& policy, | ||
const FunctorReducer& functor_reducer, const Tag tag) { | ||
return policy.team_size_max(functor_reducer.get_functor(), | ||
functor_reducer.get_reducer(), tag); | ||
} | ||
template <typename Policy, typename FunctorReducer, typename Tag> | ||
int get_recommended_team_size(const Policy& policy, | ||
const FunctorReducer& functor_reducer, | ||
const Tag tag) { | ||
return policy.team_size_recommended(functor_reducer.get_functor(), | ||
functor_reducer.get_reducer(), tag); | ||
} | ||
template <typename Policy, typename FunctorReducer> | ||
int get_mdrange_max_tile_size_product(const Policy& policy, | ||
const FunctorReducer& functor_reducer, | ||
const Kokkos::ParallelReduceTag&) { | ||
using exec_space = typename Policy::execution_space; | ||
using driver = | ||
Kokkos::Impl::ParallelReduce<FunctorReducer, Policy, exec_space>; | ||
return driver::max_tile_size_product(policy, functor_reducer.get_functor()); | ||
} | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should be able to just use SimpleTeamSizeCalculator
now that ParallelReduce
always takes a FunctorReducer
argument.
We could do this cleanup in a follow-up pull request, of course.
I will see what I can do to make sure I test the MDRange, Team and Range policies for the occupancy tuning... I know I have tested the Range with a big Arbor-X example but it would be good to have some small tests for the Kokkos build, if possible. I am not sure how that would work without including APEX, though... |
Co-authored-by: Daniel Arndt <arndtd@ornl.gov>
@masterleinad @Rombur Here's a GitHub repo that includes six test programs and will build Kokkos and APEX to test the tuning. I know it's not a Kokkos unit test, but this at least will test everything all together. It's possible this could be integrated into the kokkos-tools build, which also builds APEX. https://github.com/khuck/apex-kokkos-tuning |
That might be a good solution. |
@khuck thanks for the tests. We can put them in Kokkos-Tools or in our nightly. |
@Rombur good point, I should have added some validation. yes, there will be some stochastic behavior in the results, but I can at least check for convergence in the output text or in the output convergence result file. Let me add that... |
OK! tests are updated and expanded. I now test all the benchmarks with all the different tuning options: no tuning, exhaustive, random, genetic_search and simulated_annealing. Each test then uses the cached/converged results (if converged) to run. Then there is a fake test to clean up the converged results for the next test. I tested on a generic CUDA test machine with cuda/11.7 and Perlmutter with default modules (cuda 12.2 and gcc 12.3). I did find one problem on Frontier, though - when I run the test |
hmm I have no idea why it segfaults. Can you get a backtrace? |
OK, it's fixed. the problem was with the benchmark, I guess I don't totally understand how some of this boilerplate that David Poliakof wrote works. Apparently if the size of the View is smaller than the number of loop iterations, indexes larger than the view are passed in to the parallel_for loop. The problem was in the test, not Kokkos. |
That sounds weird. The tuning should not modify the number of work items, right? |
correct. the problem isn't with the tuning in kokkos or in apex, but the tuning_playground.hpp file. I am still looking into it... |
That's correct, and it does not. The problem is the test program, I think I have it fixed... |
@Rombur @masterleinad Ok! all tests are passing and working correctly. I tested on frontier and Perlmutter and my V100 system. I pushed the updates to the apex-kokkos-tuning repo, you should be able to build & run that - if not, let me know. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code looks fine. Once this is merged I'll add the tests in our nightly.
Kokkos::ParallelForTag{}); | ||
} | ||
#else | ||
(void)policy; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could use [[maybe_unused]]
instead of having the #else
branch,
@Rombur @masterleinad unfortunately, I seem to have found a logic bug in how tuning contexts are incremented. There is a simple fix, but I need to do a bit more testing with it... |
Minor fix to prevent incrementing the context id index when not calling `context_begin()`. In actuality, this should be refactored so that `begin_context()` increments the id, and returns it. `end_context()` is the only location that decrements the context id index.
Simple fix committed! |
Is there anything else that needs to be done with this PR so it can be merged? |
The old Kokkos fork/branch from :
https://github.com/DavidPoliakoff/kokkos/tree/feature/tune-occupancy
was merged with current Kokkos develop, and tested with ArborX to
confirm that autotuning occupancy for the DBSCAN benchmark worked.
In tests on a system with V100, the original benchmark when iterated
600 times took 119.064 seconds to run. During the tuning process
(using simulated annealing), the runtime was 108.014 seconds.
When using cached results, the runtime was 109.058 seconds. The
converged occupancy value was 70. Here are the cached results
from APEX autotuning:
In manual experiments, the ArborX team determined that the optimal
occupancy for this example was beetween 40-90, which were a 10%
improvement over baseline default of 100. See arborx/ArborX#815
for details.
One deviation from the branch that David had written - I set the occupancy
range to [5-100], with a step size of 5. The original implementation
in Kokkos used [1-100] with a step size of 1. The change was to reduce the search space
and allow faster convergence.
For anything that works, all credit goes to @DavidPoliakoff