Experimental feature: control cuda occupancy #3379

dalg24 · 2020-09-11T02:42:33Z

The intent is that, passing prefer(policy, DesiredOccupancy(33)) to a parallel_for(), parallel_reduce(), or parallel_scan will bypass the block size deduction that tries to maximize the occupancy and adjust the launch parameters (by fixing the block size and requesting shared memory) to achieve the specified occupancy. The desired occupancy is in percent.

I have only implemented it for ParallelFor<RangePolicy> and I am looking for feedback.

I am not sure how to test the feature. I have tried in ArborX on the main tree traversal kernel for spatial predicates.

diff --git a/src/details/ArborX_DetailsTreeTraversal.hpp b/src/details/ArborX_DetailsTreeTraversal.hpp
index 6c7fc1f..8b07297 100644
--- a/src/details/ArborX_DetailsTreeTraversal.hpp
+++ b/src/details/ArborX_DetailsTreeTraversal.hpp
@@ -62,8 +62,8 @@ struct TreeTraversal<BVH, Predicates, Callback, SpatialPredicateTag>
     else
     {
       Kokkos::parallel_for(ARBORX_MARK_REGION("BVH:spatial_queries"),
-                           Kokkos::RangePolicy<ExecutionSpace>(
-                               space, 0, Access::size(predicates)),
+                           Kokkos::Experimental::prefer(Kokkos::RangePolicy<ExecutionSpace>(
+                               space, 0, Access::size(predicates)), Kokkos::Experimental::DesiredOccupancy{25}),
                            *this);
     }
   }

Update Instead of "bypassing" the normal block size deduction, we decided to modify the launch configuration just before the kernel launch.

crtrott · 2020-09-11T03:43:00Z

I would consider moving the occupancy limitation all the way down to the KernelLaunch and leave the ParallelFoo functions etc. alone. That will orthogenalize the block size / team_size choice from the occupancy limitation and it would work for all policies at once instead of doing something different for every policy. Essentially just add additional shared memory request at Kernel Launch. The current approach has the additional problem that the KernelLaunch now thinks you need shared memory and thus will prefer shared memory over L1 which would be the exact wrong thing to do for RangePolicy.

crtrott

I think we need a different implementation strategy (see my other comment).

core/src/Cuda/Kokkos_Cuda_BlockSize_Deduction.hpp

DavidPoliakoff

Any thought on testing running parallel_[for/reduce/scan]'s with these policies, rather than just construction?

Rombur

Is there a way to use one of the cudaOccupancy function to check that our launch parameters make sense?

Rombur · 2020-10-21T13:18:10Z

core/src/Cuda/Kokkos_Cuda_BlockSize_Deduction.hpp

@@ -198,6 +198,31 @@ int cuda_get_opt_block_size(const CudaInternal* cuda_instance,
                                LaunchBounds{});
 }

+// Assuming cudaFuncSetCacheConfig(MyKernel, cudaFuncCachePreferL1)
+inline size_t get_shmem_per_sm_prefer_l1(cudaDeviceProp const& properties) {


There is no way to compute this?

Not that I know of

Actually there is a cudaOccSMemPerMultiprocessor() function in the "cuda_occupancy.h" header but technically it belongs to implementation details and we are not really supposed to use it.

Turns out the function does not do what we want for Volta+
It does not return the smallest possible configuration which is zero...

core/src/Cuda/Kokkos_Cuda_KernelLaunch.hpp

core/src/Kokkos_ExecPolicy.hpp

dalg24 · 2020-10-29T21:59:08Z

Is there a way to use one of the cudaOccupancy function to check that our launch parameters make sense?

No

https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__OCCUPANCY.html

crtrott

Why did you call the analyze policy property controls block size deduction? I mean you are making a runtime property value called desired occupancy depending on that thing. Why not name it also that?

crtrott · 2020-10-30T20:12:30Z

Oh otherwise I am good with the technical implementation.

dalg24 · 2020-10-30T23:21:04Z

Why did you call the analyze policy property controls block size deduction? I mean you are making a runtime property value called desired occupancy depending on that thing. Why not name it also that?

Because I was thinking about extensibility.

core/src/Cuda/Kokkos_Cuda_KernelLaunch.hpp

aprokop · 2020-10-30T23:51:42Z

I just tested this PR with ArborX' HACC problem (36M) using V100 and was able to reproduce results from early October using the original @dalg24's branch in his fork here. For this use case, it results in 25% faster runtime for the query portion of the algorithm when used with 50% or 75% occupancy.

core/src/impl/Kokkos_AnalyzePolicy.hpp

core/src/Cuda/Kokkos_Cuda_BlockSize_Deduction.hpp

core/src/Cuda/Kokkos_Cuda_KernelLaunch.hpp

core/src/Kokkos_ExecPolicy.hpp

core/src/impl/Kokkos_AnalyzePolicy.hpp

core/unit_test/TestPolicyConstruction.hpp

dalg24 · 2020-11-03T17:04:14Z

core/src/impl/Kokkos_AnalyzePolicy.hpp

-  std::enable_if_t<Enable> experimental_set_desired_occupancy(
-      Experimental::DesiredOccupancy desired_occupancy) {
-    this->m_occupancy = {desired_occupancy};
+  auto experimental_get_desired_occupancy() const {


I actually meant to SFINAE away this setter and the getter below when experimental_contains_desired_occupancy is false

dalg24 · 2020-11-03T21:12:25Z

core/src/Cuda/Kokkos_Cuda_KernelLaunch.hpp

-  (void)cache_config_preference_cached;
+  if (cache_config_preference_cached != prefer_shmem) {
+    CUDA_SAFE_CALL(cudaFuncSetCacheConfig(
+        func,
+        (prefer_shmem ? cudaFuncCachePreferShared : cudaFuncCachePreferL1)));
+    cache_config_preference_cached = prefer_shmem;
+  }


Is being addressed in #3560

dalg24 · 2020-11-04T20:18:51Z

Retest this please

core/src/Cuda/Kokkos_Cuda_KernelLaunch.hpp

Added PolicyTraits::occupancy_control type member

…tion of Policy::occupancy_contol

Specialize PolicyPropertyAdaptor for DesiredOccupancy and MaximizeOccupancy Add overloads Experimental::prefer(Policy, OccupancyControl)

…ancy

nliber

Minor comment on C-style cast notwithstanding, looks fine (although the PolicyTraitsBase having nine template parameters is a bit intimidating, especially the conditional_t block, but that is the pattern already in use here).

core/src/Cuda/Kokkos_Cuda_BlockSize_Deduction.hpp

core/src/Cuda/Kokkos_Cuda_KernelLaunch.hpp

masterleinad · 2020-11-09T14:54:45Z

The relevant MSVC error message is

C:\projects\source\core\src\impl/Kokkos_AnalyzePolicy.hpp(313,57): error C2039: 'occupancy_control': is not a member of 'Kokkos::Serial' [C:\projects\source\build\core\src\kokkoscore.vcxproj]
C:\projects\source\core\src\Kokkos_Serial.hpp(88): message : see declaration of 'Kokkos::Serial' [C:\projects\source\build\core\src\kokkoscore.vcxproj]
C:\projects\source\core\src\impl/Kokkos_AnalyzePolicy.hpp(394): message : see reference to class template instantiation 'Kokkos::Impl::PolicyDataStorage<Kokkos::DefaultHostExecutionSpace>' being compiled [C:\projects\source\build\core\src\kokkoscore.vcxproj]
C:\projects\source\core\src\Kokkos_ExecPolicy.hpp(91): message : see reference to class template instantiation 'Kokkos::Impl::PolicyTraits<Kokkos::DefaultHostExecutionSpace>' being compiled [C:\projects\source\build\core\src\kokkoscore.vcxproj]
C:\projects\source\core\src\impl\Kokkos_HostSpace_deepcopy.cpp(83): message : see reference to class template instantiation 'Kokkos::RangePolicy<Kokkos::DefaultHostExecutionSpace>' being compiled [C:\projects\source\build\core\src\kokkoscore.vcxproj]

dhollman · 2020-11-09T17:37:49Z

LGTM once it passes CI

crtrott

Need to fix the "divide by zero" issue and we should round to nearest probably not the next lower?

crtrott · 2020-11-10T00:27:04Z

core/src/Cuda/Kokkos_Cuda_KernelLaunch.hpp

+  size_t const shmem_per_sm_prefer_l1 = get_shmem_per_sm_prefer_l1(properties);
+  size_t const static_shmem           = attributes.sharedSizeBytes;
+  int active_blocks = properties.maxThreadsPerMultiProcessor / block_size *
+                      desired_occupancy / 100;


I think this is iffy. Say maxThreadsPerMultiProc is 2048, block size is 700, desired_occupancy 33 doesn't that get me a zero here? (2048/700 = 2.9.. = 2 * 33 = 66 / 100 = 0?) even though 700 /2048 > 0.33?
In either case don't we need to make active_blocks ==0 into active_blocks=1 before dividing in the next line?

And even if you use 670 as the block size (which makes you go just below 0.33) you end up with zero. Since 3*33 is 99 and that still divides away to zero. This problem can be solved by just making the computation use double. We also probably should round and not round towards zero. Like if it computes that 1.9 blocks give you the desired occupancy, I think we should round to 2 instead of 1.

In either case the potential divide by zero needs to be fixed.

crtrott · 2020-11-10T00:50:28Z

core/src/Kokkos_ExecPolicy.hpp

+                              typename policy_in_t::traits::iteration_pattern,
+                              typename policy_in_t::traits::launch_bounds,
+                              typename policy_in_t::traits::work_item_property,
+                              MaximizeOccupancy>;


shouldn't there be a graph tag in here?

ah we don't need that since graph tags are not a public one and so shouldn't exist by the time prefere/require is called right?

#3379 (comment)

crtrott · 2020-11-10T01:28:07Z

I should add: other than the divide by zero issue and the rounding I am good with this.

when computing number of active blocks in modify_launch_configuration_if_desired_occupancy_is_specified Co-Authored-By: Christian Trott <crtrott@sandia.gov>

dalg24 · 2020-11-10T04:03:52Z

Retest this please

aprokop · 2020-11-10T14:59:04Z

Can confirm that 3824a11 works as expected on ArborX' HACC halo finding problem (see #3379 (comment)).

crtrott

Looks good. Can we just restart that one test in jenkins? Was a timing issue.

dalg24 · 2020-11-10T16:23:49Z

Looks good. Can we just restart that one test in jenkins? Was a timing issue.

Build passed https://cloud.cees.ornl.gov/jenkins-ci/blue/organizations/jenkins/Kokkos/detail/Kokkos/3487/pipeline/

dalg24 · 2020-11-10T18:07:38Z

🥳 Thanks @aprokop and @dhollman for helping with this

dalg24 added the [WIP] label Sep 11, 2020

crtrott requested changes Sep 11, 2020

View reviewed changes

dalg24 mentioned this pull request Sep 30, 2020

Add knobs to control occupancy in bvh traversal algorithms arborx/ArborX#323

Closed

4 tasks

dalg24 force-pushed the control_cuda_occupancy branch from 3df9ae1 to 2f7f17b Compare October 6, 2020 16:01

crtrott reviewed Oct 6, 2020

View reviewed changes

core/src/Cuda/Kokkos_Cuda_BlockSize_Deduction.hpp Outdated Show resolved Hide resolved

DavidPoliakoff reviewed Oct 6, 2020

View reviewed changes

dalg24 force-pushed the control_cuda_occupancy branch from 9c8a531 to 223877f Compare October 15, 2020 23:08

dalg24 mentioned this pull request Oct 16, 2020

Add missing TeamPolicyInternal<OpenMPTarget> converting constructor #3491

Merged

Rombur reviewed Oct 21, 2020

View reviewed changes

dalg24 force-pushed the control_cuda_occupancy branch from 693c777 to afab561 Compare October 29, 2020 21:51

dalg24 removed the [WIP] label Oct 29, 2020

crtrott requested changes Oct 30, 2020

View reviewed changes

dalg24 commented Oct 30, 2020

View reviewed changes

core/src/Cuda/Kokkos_Cuda_KernelLaunch.hpp Outdated Show resolved Hide resolved

dalg24 force-pushed the control_cuda_occupancy branch from 166c8ad to e38bec1 Compare October 31, 2020 02:36

dalg24 commented Oct 31, 2020

View reviewed changes

core/src/impl/Kokkos_AnalyzePolicy.hpp Outdated Show resolved Hide resolved

dalg24 force-pushed the control_cuda_occupancy branch from cc8eb5b to 451b705 Compare November 2, 2020 16:12

dhollman suggested changes Nov 2, 2020

View reviewed changes

dalg24 mentioned this pull request Nov 3, 2020

[CUDA] Re-set preferred cache configuration if it changed #3560

Merged

dalg24 commented Nov 3, 2020

View reviewed changes

dalg24 force-pushed the control_cuda_occupancy branch from 49b12fb to d0224b2 Compare November 3, 2020 21:19

dalg24 commented Nov 5, 2020

View reviewed changes

core/src/Cuda/Kokkos_Cuda_KernelLaunch.hpp Show resolved Hide resolved

dalg24 mentioned this pull request Nov 5, 2020

Cleanup PolicyPropertyAdaptor specializations for WorkItemPropery #3574

Merged

Add policy getter to all Parallel{For,Reduce,Scan} Cuda specializations

649a104

dalg24 force-pushed the control_cuda_occupancy branch from d9e97a3 to f64b884 Compare November 6, 2020 03:06

dalg24 added 5 commits November 5, 2020 22:14

Add DesiredOccupancy and MaximizeOccupancy as new policy property

29838c3

Added PolicyTraits::occupancy_control type member

Store desired occupany in the policy traits

9a8f854

Update PolicyPropertyAdaptor<WorkItemPropery, Policy> to reflect addi…

7845846

…tion of Policy::occupancy_contol

Add prefer(policy, DesiredOccupancy{25}) to Experimental::

63cddf6

Specialize PolicyPropertyAdaptor for DesiredOccupancy and MaximizeOccupancy Add overloads Experimental::prefer(Policy, OccupancyControl)

Update all policies converting constructors to copy the desired occup…

8dad073

…ancy

dalg24 force-pushed the control_cuda_occupancy branch from f64b884 to 7325df4 Compare November 6, 2020 03:15

nliber approved these changes Nov 6, 2020

View reviewed changes

core/src/Cuda/Kokkos_Cuda_BlockSize_Deduction.hpp Show resolved Hide resolved

core/src/Cuda/Kokkos_Cuda_KernelLaunch.hpp Outdated Show resolved Hide resolved

dalg24 requested review from dhollman and crtrott November 6, 2020 18:19

dhollman approved these changes Nov 9, 2020

View reviewed changes

dhollman force-pushed the control_cuda_occupancy branch from 4fdd237 to f0f5261 Compare November 9, 2020 19:36

crtrott requested changes Nov 10, 2020

View reviewed changes

dalg24 and others added 5 commits November 9, 2020 22:30

Modify launch configuration with CUDA if desired occupancy is specified

28bd7c5

Add unit tests for control occupancy via prefer

985bdea

Add defaulted PolicyTraits default constructor to fix some of CI builds

1d93f1c

workaround MSVC bug

2ac260d

Prevent division by zero and round to the nearest integer value

3824a11

when computing number of active blocks in modify_launch_configuration_if_desired_occupancy_is_specified Co-Authored-By: Christian Trott <crtrott@sandia.gov>

dalg24 force-pushed the control_cuda_occupancy branch from f0f5261 to 3824a11 Compare November 10, 2020 03:30

crtrott approved these changes Nov 10, 2020

View reviewed changes

crtrott merged commit 271da16 into kokkos:develop Nov 10, 2020

dalg24 deleted the control_cuda_occupancy branch November 10, 2020 18:27

masterleinad mentioned this pull request Jul 12, 2023

CTAD Deduction guides for TeamPolicy v2 #6117

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Experimental feature: control cuda occupancy #3379

Experimental feature: control cuda occupancy #3379

dalg24 commented Sep 11, 2020 •

edited

crtrott commented Sep 11, 2020

crtrott left a comment

DavidPoliakoff left a comment

Rombur left a comment

Rombur Oct 21, 2020

dalg24 Oct 29, 2020

dalg24 Nov 3, 2020 •

edited

dalg24 Nov 4, 2020

dalg24 commented Oct 29, 2020

crtrott left a comment

crtrott commented Oct 30, 2020

dalg24 commented Oct 30, 2020

aprokop commented Oct 30, 2020 •

edited

dalg24 Nov 3, 2020

dalg24 Nov 3, 2020

dalg24 commented Nov 4, 2020

nliber left a comment

masterleinad commented Nov 9, 2020

dhollman commented Nov 9, 2020

crtrott left a comment

crtrott Nov 10, 2020

crtrott Nov 10, 2020

crtrott Nov 10, 2020

dalg24 Nov 10, 2020

crtrott Nov 10, 2020

crtrott Nov 10, 2020

dalg24 Nov 10, 2020

crtrott commented Nov 10, 2020

dalg24 commented Nov 10, 2020

aprokop commented Nov 10, 2020

crtrott left a comment

dalg24 commented Nov 10, 2020

dalg24 commented Nov 10, 2020

Experimental feature: control cuda occupancy #3379

Experimental feature: control cuda occupancy #3379

Conversation

dalg24 commented Sep 11, 2020 • edited

crtrott commented Sep 11, 2020

crtrott left a comment

Choose a reason for hiding this comment

DavidPoliakoff left a comment

Choose a reason for hiding this comment

Rombur left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dalg24 Nov 3, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dalg24 commented Oct 29, 2020

crtrott left a comment

Choose a reason for hiding this comment

crtrott commented Oct 30, 2020

dalg24 commented Oct 30, 2020

aprokop commented Oct 30, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dalg24 commented Nov 4, 2020

nliber left a comment

Choose a reason for hiding this comment

masterleinad commented Nov 9, 2020

dhollman commented Nov 9, 2020

crtrott left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

crtrott commented Nov 10, 2020

dalg24 commented Nov 10, 2020

aprokop commented Nov 10, 2020

crtrott left a comment

Choose a reason for hiding this comment

dalg24 commented Nov 10, 2020

dalg24 commented Nov 10, 2020

dalg24 commented Sep 11, 2020 •

edited

dalg24 Nov 3, 2020 •

edited

aprokop commented Oct 30, 2020 •

edited