Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Experimental feature: control cuda occupancy #3379

Merged
merged 12 commits into from
Nov 10, 2020

Conversation

dalg24
Copy link
Member

@dalg24 dalg24 commented Sep 11, 2020

The intent is that, passing prefer(policy, DesiredOccupancy(33)) to a parallel_for(), parallel_reduce(), or parallel_scan will bypass the block size deduction that tries to maximize the occupancy and adjust the launch parameters (by fixing the block size and requesting shared memory) to achieve the specified occupancy. The desired occupancy is in percent.

I have only implemented it for ParallelFor<RangePolicy> and I am looking for feedback.

I am not sure how to test the feature. I have tried in ArborX on the main tree traversal kernel for spatial predicates.

diff --git a/src/details/ArborX_DetailsTreeTraversal.hpp b/src/details/ArborX_DetailsTreeTraversal.hpp
index 6c7fc1f..8b07297 100644
--- a/src/details/ArborX_DetailsTreeTraversal.hpp
+++ b/src/details/ArborX_DetailsTreeTraversal.hpp
@@ -62,8 +62,8 @@ struct TreeTraversal<BVH, Predicates, Callback, SpatialPredicateTag>
     else
     {
       Kokkos::parallel_for(ARBORX_MARK_REGION("BVH:spatial_queries"),
-                           Kokkos::RangePolicy<ExecutionSpace>(
-                               space, 0, Access::size(predicates)),
+                           Kokkos::Experimental::prefer(Kokkos::RangePolicy<ExecutionSpace>(
+                               space, 0, Access::size(predicates)), Kokkos::Experimental::DesiredOccupancy{25}),
                            *this);
     }
   }

Update Instead of "bypassing" the normal block size deduction, we decided to modify the launch configuration just before the kernel launch.

@dalg24 dalg24 added the [WIP] label Sep 11, 2020
@crtrott
Copy link
Member

crtrott commented Sep 11, 2020

I would consider moving the occupancy limitation all the way down to the KernelLaunch and leave the ParallelFoo functions etc. alone. That will orthogenalize the block size / team_size choice from the occupancy limitation and it would work for all policies at once instead of doing something different for every policy. Essentially just add additional shared memory request at Kernel Launch. The current approach has the additional problem that the KernelLaunch now thinks you need shared memory and thus will prefer shared memory over L1 which would be the exact wrong thing to do for RangePolicy.

Copy link
Member

@crtrott crtrott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need a different implementation strategy (see my other comment).

Copy link
Contributor

@DavidPoliakoff DavidPoliakoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any thought on testing running parallel_[for/reduce/scan]'s with these policies, rather than just construction?

Copy link
Member

@Rombur Rombur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to use one of the cudaOccupancy function to check that our launch parameters make sense?

@@ -198,6 +198,31 @@ int cuda_get_opt_block_size(const CudaInternal* cuda_instance,
LaunchBounds{});
}

// Assuming cudaFuncSetCacheConfig(MyKernel, cudaFuncCachePreferL1)
inline size_t get_shmem_per_sm_prefer_l1(cudaDeviceProp const& properties) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no way to compute this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not that I know of

Copy link
Member Author

@dalg24 dalg24 Nov 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually there is a cudaOccSMemPerMultiprocessor() function in the "cuda_occupancy.h" header but technically it belongs to implementation details and we are not really supposed to use it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns out the function does not do what we want for Volta+
It does not return the smallest possible configuration which is zero...

core/src/Cuda/Kokkos_Cuda_KernelLaunch.hpp Outdated Show resolved Hide resolved
core/src/Kokkos_ExecPolicy.hpp Outdated Show resolved Hide resolved
@dalg24
Copy link
Member Author

dalg24 commented Oct 29, 2020

Is there a way to use one of the cudaOccupancy function to check that our launch parameters make sense?

No

https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__OCCUPANCY.html

Copy link
Member

@crtrott crtrott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you call the analyze policy property controls block size deduction? I mean you are making a runtime property value called desired occupancy depending on that thing. Why not name it also that?

@crtrott
Copy link
Member

crtrott commented Oct 30, 2020

Oh otherwise I am good with the technical implementation.

@dalg24
Copy link
Member Author

dalg24 commented Oct 30, 2020

Why did you call the analyze policy property controls block size deduction? I mean you are making a runtime property value called desired occupancy depending on that thing. Why not name it also that?

Because I was thinking about extensibility.

@aprokop
Copy link
Collaborator

aprokop commented Oct 30, 2020

I just tested this PR with ArborX' HACC problem (36M) using V100 and was able to reproduce results from early October using the original @dalg24's branch in his fork here. For this use case, it results in 25% faster runtime for the query portion of the algorithm when used with 50% or 75% occupancy.

core/src/Cuda/Kokkos_Cuda_KernelLaunch.hpp Outdated Show resolved Hide resolved
core/src/Kokkos_ExecPolicy.hpp Outdated Show resolved Hide resolved
core/src/impl/Kokkos_AnalyzePolicy.hpp Outdated Show resolved Hide resolved
core/src/impl/Kokkos_AnalyzePolicy.hpp Outdated Show resolved Hide resolved
core/src/impl/Kokkos_AnalyzePolicy.hpp Outdated Show resolved Hide resolved
core/src/impl/Kokkos_AnalyzePolicy.hpp Outdated Show resolved Hide resolved
core/src/impl/Kokkos_AnalyzePolicy.hpp Outdated Show resolved Hide resolved
core/src/impl/Kokkos_AnalyzePolicy.hpp Outdated Show resolved Hide resolved
core/unit_test/TestPolicyConstruction.hpp Outdated Show resolved Hide resolved
std::enable_if_t<Enable> experimental_set_desired_occupancy(
Experimental::DesiredOccupancy desired_occupancy) {
this->m_occupancy = {desired_occupancy};
auto experimental_get_desired_occupancy() const {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually meant to SFINAE away this setter and the getter below when experimental_contains_desired_occupancy is false

Comment on lines 167 to 175
(void)cache_config_preference_cached;
if (cache_config_preference_cached != prefer_shmem) {
CUDA_SAFE_CALL(cudaFuncSetCacheConfig(
func,
(prefer_shmem ? cudaFuncCachePreferShared : cudaFuncCachePreferL1)));
cache_config_preference_cached = prefer_shmem;
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is being addressed in #3560

@dalg24
Copy link
Member Author

dalg24 commented Nov 4, 2020

Retest this please

Added PolicyTraits::occupancy_control type member
Specialize PolicyPropertyAdaptor for DesiredOccupancy and MaximizeOccupancy
Add overloads Experimental::prefer(Policy, OccupancyControl)
Copy link
Contributor

@nliber nliber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comment on C-style cast notwithstanding, looks fine (although the PolicyTraitsBase having nine template parameters is a bit intimidating, especially the conditional_t block, but that is the pattern already in use here).

core/src/Cuda/Kokkos_Cuda_KernelLaunch.hpp Outdated Show resolved Hide resolved
@masterleinad
Copy link
Contributor

The relevant MSVC error message is

C:\projects\source\core\src\impl/Kokkos_AnalyzePolicy.hpp(313,57): error C2039: 'occupancy_control': is not a member of 'Kokkos::Serial' [C:\projects\source\build\core\src\kokkoscore.vcxproj]
C:\projects\source\core\src\Kokkos_Serial.hpp(88): message : see declaration of 'Kokkos::Serial' [C:\projects\source\build\core\src\kokkoscore.vcxproj]
C:\projects\source\core\src\impl/Kokkos_AnalyzePolicy.hpp(394): message : see reference to class template instantiation 'Kokkos::Impl::PolicyDataStorage<Kokkos::DefaultHostExecutionSpace>' being compiled [C:\projects\source\build\core\src\kokkoscore.vcxproj]
C:\projects\source\core\src\Kokkos_ExecPolicy.hpp(91): message : see reference to class template instantiation 'Kokkos::Impl::PolicyTraits<Kokkos::DefaultHostExecutionSpace>' being compiled [C:\projects\source\build\core\src\kokkoscore.vcxproj]
C:\projects\source\core\src\impl\Kokkos_HostSpace_deepcopy.cpp(83): message : see reference to class template instantiation 'Kokkos::RangePolicy<Kokkos::DefaultHostExecutionSpace>' being compiled [C:\projects\source\build\core\src\kokkoscore.vcxproj]

@dhollman
Copy link

dhollman commented Nov 9, 2020

LGTM once it passes CI

Copy link
Member

@crtrott crtrott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to fix the "divide by zero" issue and we should round to nearest probably not the next lower?

size_t const shmem_per_sm_prefer_l1 = get_shmem_per_sm_prefer_l1(properties);
size_t const static_shmem = attributes.sharedSizeBytes;
int active_blocks = properties.maxThreadsPerMultiProcessor / block_size *
desired_occupancy / 100;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is iffy. Say maxThreadsPerMultiProc is 2048, block size is 700, desired_occupancy 33 doesn't that get me a zero here? (2048/700 = 2.9.. = 2 * 33 = 66 / 100 = 0?) even though 700 /2048 > 0.33?
In either case don't we need to make active_blocks ==0 into active_blocks=1 before dividing in the next line?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And even if you use 670 as the block size (which makes you go just below 0.33) you end up with zero. Since 3*33 is 99 and that still divides away to zero. This problem can be solved by just making the computation use double. We also probably should round and not round towards zero. Like if it computes that 1.9 blocks give you the desired occupancy, I think we should round to 2 instead of 1.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In either case the potential divide by zero needs to be fixed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

typename policy_in_t::traits::iteration_pattern,
typename policy_in_t::traits::launch_bounds,
typename policy_in_t::traits::work_item_property,
MaximizeOccupancy>;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't there be a graph tag in here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah we don't need that since graph tags are not a public one and so shouldn't exist by the time prefere/require is called right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@crtrott
Copy link
Member

crtrott commented Nov 10, 2020

I should add: other than the divide by zero issue and the rounding I am good with this.

@dalg24
Copy link
Member Author

dalg24 commented Nov 10, 2020

Retest this please

@aprokop
Copy link
Collaborator

aprokop commented Nov 10, 2020

Can confirm that 3824a11 works as expected on ArborX' HACC halo finding problem (see #3379 (comment)).

Copy link
Member

@crtrott crtrott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Can we just restart that one test in jenkins? Was a timing issue.

@dalg24
Copy link
Member Author

dalg24 commented Nov 10, 2020

Looks good. Can we just restart that one test in jenkins? Was a timing issue.

Build passed https://cloud.cees.ornl.gov/jenkins-ci/blue/organizations/jenkins/Kokkos/detail/Kokkos/3487/pipeline/

@crtrott crtrott merged commit 271da16 into kokkos:develop Nov 10, 2020
@dalg24
Copy link
Member Author

dalg24 commented Nov 10, 2020

🥳 Thanks @aprokop and @dhollman for helping with this

@dalg24 dalg24 deleted the control_cuda_occupancy branch November 10, 2020 18:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants