-
Notifications
You must be signed in to change notification settings - Fork 407
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cuda improve heuristic for blocksize #4271
Conversation
What are the improvements you are getting when running the benchmark? |
I was about to ask the same but see the commit message ea37ea3 |
|
ea37ea3
to
1c763e1
Compare
@crtrott: that matches the best autotuning numbers I can get |
yeah but it was wrong :-) (Daniel noted that)
This is the real number. To get the 807 with set you need a block size of 256, but that has more detremential impact for more complex kernels. So I thought we go with 128, which only leaves kernels which do a single memory op per thread of by 25%. |
Oh, my stream doesn't have "set", just the other 4
…On Fri, Aug 27, 2021 at 11:09 AM Christian Trott ***@***.***> wrote:
yeah but it was wrong :-)
Set 327316.30 MB/s
Copy 654344.27 MB/s
Scale 654263.20 MB/s
Add 846497.84 MB/s
Triad 844604.40 MB/s
With this change:
Set 652713.29 MB/s
Copy 807649.65 MB/s
Scale 808014.29 MB/s
Add 847403.47 MB/s
Triad 845885.63 MB/s
This is the real number. To get the 807 with set you need a block size of
256, but that has more detremential impact for more complex kernels. So I
thought we go with 128, which only leaves kernels which do a single memory
op per thread of by 25%.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#4271 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABAGLSLUAV6AWDYRPNALLD3T67IHPANCNFSM5C4R32WQ>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
--
Thanks,
David
|
Some experiments deomnstrated that for certain kernels the current heuristic isn't great. In particular copy and memset kernels were bad. Using the updated stream benchmark I got before this change: Set 327316.30 MB/s Copy 654344.27 MB/s Scale 654263.20 MB/s Add 846497.84 MB/s Triad 844604.40 MB/s With this change: Set 652713.29 MB/s Copy 807649.65 MB/s Scale 808014.29 MB/s Add 847403.47 MB/s Triad 845885.63 MB/s ExaminidMD also improved from 2.48e+08 to 2.82e+08: 1 256000 | 0.906401 0.480328 0.142917 0.165107 0.117937 | 1103.264687 2.824358e+08 2.824358e+08 PERFORMANCE 1 256000 | 1.030611 0.501819 0.243033 0.163163 0.122484 | 970.297956 2.483963e+08 2.483963e+08 PERFORMANCE
1c763e1
to
501f056
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks OK to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should make a note in the release about this, in case some people have a bad reaction
I also test 256 with ExaMIniMD and its slower than 128: Here for 3 different sizes (20^3, 30^3 and 40^3, i.e. 32k atoms, ~100k atoms and 256k atoms)
|
This also updated the stream benchmark, necessary to demonstrate the benefit.