-
Notifications
You must be signed in to change notification settings - Fork 407
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Experimental feature: control cuda occupancy #3379
Conversation
I would consider moving the occupancy limitation all the way down to the KernelLaunch and leave the ParallelFoo functions etc. alone. That will orthogenalize the block size / team_size choice from the occupancy limitation and it would work for all policies at once instead of doing something different for every policy. Essentially just add additional shared memory request at Kernel Launch. The current approach has the additional problem that the KernelLaunch now thinks you need shared memory and thus will prefer shared memory over L1 which would be the exact wrong thing to do for RangePolicy. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need a different implementation strategy (see my other comment).
3df9ae1
to
2f7f17b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any thought on testing running parallel_[for/reduce/scan]'s with these policies, rather than just construction?
9c8a531
to
223877f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way to use one of the cudaOccupancy
function to check that our launch parameters make sense?
@@ -198,6 +198,31 @@ int cuda_get_opt_block_size(const CudaInternal* cuda_instance, | |||
LaunchBounds{}); | |||
} | |||
|
|||
// Assuming cudaFuncSetCacheConfig(MyKernel, cudaFuncCachePreferL1) | |||
inline size_t get_shmem_per_sm_prefer_l1(cudaDeviceProp const& properties) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no way to compute this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not that I know of
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually there is a cudaOccSMemPerMultiprocessor()
function in the "cuda_occupancy.h"
header but technically it belongs to implementation details and we are not really supposed to use it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Turns out the function does not do what we want for Volta+
It does not return the smallest possible configuration which is zero...
693c777
to
afab561
Compare
No https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__OCCUPANCY.html |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did you call the analyze policy property controls block size deduction? I mean you are making a runtime property value called desired occupancy depending on that thing. Why not name it also that?
Oh otherwise I am good with the technical implementation. |
Because I was thinking about extensibility. |
166c8ad
to
e38bec1
Compare
cc8eb5b
to
451b705
Compare
std::enable_if_t<Enable> experimental_set_desired_occupancy( | ||
Experimental::DesiredOccupancy desired_occupancy) { | ||
this->m_occupancy = {desired_occupancy}; | ||
auto experimental_get_desired_occupancy() const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually meant to SFINAE away this setter and the getter below when experimental_contains_desired_occupancy
is false
(void)cache_config_preference_cached; | ||
if (cache_config_preference_cached != prefer_shmem) { | ||
CUDA_SAFE_CALL(cudaFuncSetCacheConfig( | ||
func, | ||
(prefer_shmem ? cudaFuncCachePreferShared : cudaFuncCachePreferL1))); | ||
cache_config_preference_cached = prefer_shmem; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is being addressed in #3560
49b12fb
to
d0224b2
Compare
Retest this please |
d9e97a3
to
f64b884
Compare
Added PolicyTraits::occupancy_control type member
…tion of Policy::occupancy_contol
Specialize PolicyPropertyAdaptor for DesiredOccupancy and MaximizeOccupancy Add overloads Experimental::prefer(Policy, OccupancyControl)
f64b884
to
7325df4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor comment on C-style cast notwithstanding, looks fine (although the PolicyTraitsBase having nine template parameters is a bit intimidating, especially the conditional_t block, but that is the pattern already in use here).
The relevant MSVC error message is
|
LGTM once it passes CI |
4fdd237
to
f0f5261
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to fix the "divide by zero" issue and we should round to nearest probably not the next lower?
size_t const shmem_per_sm_prefer_l1 = get_shmem_per_sm_prefer_l1(properties); | ||
size_t const static_shmem = attributes.sharedSizeBytes; | ||
int active_blocks = properties.maxThreadsPerMultiProcessor / block_size * | ||
desired_occupancy / 100; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is iffy. Say maxThreadsPerMultiProc is 2048, block size is 700, desired_occupancy 33 doesn't that get me a zero here? (2048/700 = 2.9.. = 2 * 33 = 66 / 100 = 0?) even though 700 /2048 > 0.33?
In either case don't we need to make active_blocks ==0 into active_blocks=1 before dividing in the next line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And even if you use 670 as the block size (which makes you go just below 0.33) you end up with zero. Since 3*33 is 99 and that still divides away to zero. This problem can be solved by just making the computation use double. We also probably should round and not round towards zero. Like if it computes that 1.9 blocks give you the desired occupancy, I think we should round to 2 instead of 1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In either case the potential divide by zero needs to be fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
typename policy_in_t::traits::iteration_pattern, | ||
typename policy_in_t::traits::launch_bounds, | ||
typename policy_in_t::traits::work_item_property, | ||
MaximizeOccupancy>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't there be a graph tag in here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah we don't need that since graph tags are not a public one and so shouldn't exist by the time prefere/require is called right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should add: other than the divide by zero issue and the rounding I am good with this. |
when computing number of active blocks in modify_launch_configuration_if_desired_occupancy_is_specified Co-Authored-By: Christian Trott <crtrott@sandia.gov>
f0f5261
to
3824a11
Compare
Retest this please |
Can confirm that 3824a11 works as expected on ArborX' HACC halo finding problem (see #3379 (comment)). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Can we just restart that one test in jenkins? Was a timing issue.
Build passed https://cloud.cees.ornl.gov/jenkins-ci/blue/organizations/jenkins/Kokkos/detail/Kokkos/3487/pipeline/ |
The intent is that, passing
prefer(policy, DesiredOccupancy(33))
to aparallel_for()
,parallel_reduce()
, orparallel_scan
will bypass the block size deduction that tries to maximize the occupancy and adjust the launch parameters (by fixing the block size and requesting shared memory) to achieve the specified occupancy. The desired occupancy is in percent.I have only implemented it for
ParallelFor<RangePolicy>
and I am looking for feedback.I am not sure how to test the feature. I have tried in ArborX on the main tree traversal kernel for spatial predicates.
Update Instead of "bypassing" the normal block size deduction, we decided to modify the launch configuration just before the kernel launch.