-
Notifications
You must be signed in to change notification settings - Fork 407
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rework CUDA cache config using carveout calculations #5706
Conversation
1743a85
to
937fa4f
Compare
937fa4f
to
7de2b62
Compare
Tagged as blocking in lieu of #4295 |
0dbd2fa
to
b387701
Compare
Make sure you add a description |
b387701
to
119d29d
Compare
Retest this please. |
inline void configure_shmem_preference(const KernelFuncPtr& func, | ||
const cudaDeviceProp& device_props, | ||
const size_t block_size, int& shmem, | ||
const size_t occupancy) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wish you did not interleave out and in parameters...
size_t num_blocks_desired = | ||
(num_threads_desired + block_size * 0.8) / block_size; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid "magic" numbers. Give it a name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
null_point_eight
?
This will generally rely on the CUDA runtime to figure out best cache configuration, without us setting it manually anymore. We only set it manually if and when someone requests restricted occupancy. However, I did test the carveout scheme when always manually setting, and it did basically replicate the performance obtained by not setting anything for LAMMPS - and beats what we had before which was more limited. I also wrote a test code to check that the occupancy limitation works more or less as expected. Note that this is fairly iffy since there are many moving parts.
We may consider adding this test, however it requires that the GPU is used exclusively.