Cuda improve heuristic for blocksize #4271

crtrott · 2021-08-27T01:58:24Z

This also updated the stream benchmark, necessary to demonstrate the benefit.

benchmarks/stream/stream-kokkos.cpp

masterleinad · 2021-08-27T15:24:20Z

What are the improvements you are getting when running the benchmark?

dalg24 · 2021-08-27T16:13:40Z

What are the improvements you are getting when running the benchmark?

I was about to ask the same but see the commit message ea37ea3

core/src/Cuda/Kokkos_Cuda_BlockSize_Deduction.hpp

benchmarks/stream/stream-kokkos.cpp

masterleinad · 2021-08-27T16:33:28Z

What are the improvements you are getting when running the benchmark?

I was about to ask the same but see the commit message ea37ea3

Some experiments deomnstrated that for certain kernels the
current heuristic isn't great. In particular copy and memset
kernels were bad.

Using the updated stream benchmark I got before this change:

Set               654385.49 MB/s
Copy              654385.49 MB/s
Scale             654398.87 MB/s
Add               846436.34 MB/s
Triad             844568.74 MB/s

With this change:

Set               806107.54 MB/s
Copy              805491.91 MB/s
Scale             807181.31 MB/s
Add               845471.17 MB/s
Triad             845531.05 MB/s

ExaminidMD also improved from 2.48e+08 to 2.82e+08:

1 256000 | 0.906401 0.480328 0.142917 0.165107 0.117937 | 1103.264687 2.824358e+08 2.824358e+08 PERFORMANCE

1 256000 | 1.030611 0.501819 0.243033 0.163163 0.122484 | 970.297956 2.483963e+08 2.483963e+08 PERFORMANCE

DavidPoliakoff · 2021-08-27T16:36:01Z

@crtrott: that matches the best autotuning numbers I can get

crtrott · 2021-08-27T18:09:47Z

yeah but it was wrong :-) (Daniel noted that)

 Set               327316.30 MB/s
    Copy              654344.27 MB/s
    Scale             654263.20 MB/s
    Add               846497.84 MB/s
    Triad             844604.40 MB/s

    With this change:

    Set               652713.29 MB/s
    Copy              807649.65 MB/s
    Scale             808014.29 MB/s
    Add               847403.47 MB/s
    Triad             845885.63 MB/s

This is the real number. To get the 807 with set you need a block size of 256, but that has more detremential impact for more complex kernels. So I thought we go with 128, which only leaves kernels which do a single memory op per thread of by 25%.

DavidPoliakoff · 2021-08-27T18:17:23Z

Oh, my stream doesn't have "set", just the other 4

…

On Fri, Aug 27, 2021 at 11:09 AM Christian Trott ***@***.***> wrote: yeah but it was wrong :-) Set 327316.30 MB/s Copy 654344.27 MB/s Scale 654263.20 MB/s Add 846497.84 MB/s Triad 844604.40 MB/s With this change: Set 652713.29 MB/s Copy 807649.65 MB/s Scale 808014.29 MB/s Add 847403.47 MB/s Triad 845885.63 MB/s This is the real number. To get the 807 with set you need a block size of 256, but that has more detremential impact for more complex kernels. So I thought we go with 128, which only leaves kernels which do a single memory op per thread of by 25%. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#4271 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABAGLSLUAV6AWDYRPNALLD3T67IHPANCNFSM5C4R32WQ> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

-- Thanks, David

Some experiments deomnstrated that for certain kernels the current heuristic isn't great. In particular copy and memset kernels were bad. Using the updated stream benchmark I got before this change: Set 327316.30 MB/s Copy 654344.27 MB/s Scale 654263.20 MB/s Add 846497.84 MB/s Triad 844604.40 MB/s With this change: Set 652713.29 MB/s Copy 807649.65 MB/s Scale 808014.29 MB/s Add 847403.47 MB/s Triad 845885.63 MB/s ExaminidMD also improved from 2.48e+08 to 2.82e+08: 1 256000 | 0.906401 0.480328 0.142917 0.165107 0.117937 | 1103.264687 2.824358e+08 2.824358e+08 PERFORMANCE 1 256000 | 1.030611 0.501819 0.243033 0.163163 0.122484 | 970.297956 2.483963e+08 2.483963e+08 PERFORMANCE

masterleinad

Looks OK to me.

DavidPoliakoff

Should make a note in the release about this, in case some people have a bad reaction

crtrott · 2021-08-27T20:22:34Z

I also test 256 with ExaMIniMD and its slower than 128:

Here for 3 different sizes (20^3, 30^3 and 40^3, i.e. 32k atoms, ~100k atoms and 256k atoms)

    64   128  256
20  1.04 1.05 1.03
30  1.94 2.14 2.01
40  2.49 2.84 2.70

crtrott added this to In progress in Kokkos Release 3.5 via automation Aug 27, 2021

crtrott moved this from In progress to Awaiting Feedback in Kokkos Release 3.5 Aug 27, 2021

crtrott added the Blocks Promotion Overview issue for release-blocking bugs label Aug 27, 2021

masterleinad reviewed Aug 27, 2021

View reviewed changes

benchmarks/stream/stream-kokkos.cpp Outdated Show resolved Hide resolved

dalg24 requested changes Aug 27, 2021

View reviewed changes

core/src/Cuda/Kokkos_Cuda_BlockSize_Deduction.hpp Show resolved Hide resolved

benchmarks/stream/stream-kokkos.cpp Outdated Show resolved Hide resolved

benchmarks/stream/stream-kokkos.cpp Outdated Show resolved Hide resolved

Update Stream Benchmark

a47a9a7

crtrott force-pushed the cuda-fix-heuristic branch from ea37ea3 to 1c763e1 Compare August 27, 2021 16:34

crtrott force-pushed the cuda-fix-heuristic branch from 1c763e1 to 501f056 Compare August 27, 2021 18:23

masterleinad approved these changes Aug 27, 2021

View reviewed changes

DavidPoliakoff approved these changes Aug 27, 2021

View reviewed changes

dalg24 approved these changes Aug 27, 2021

View reviewed changes

dalg24 merged commit 52d1c93 into kokkos:develop Aug 27, 2021

Kokkos Release 3.5 automation moved this from Awaiting Feedback to Done Aug 27, 2021

crtrott deleted the cuda-fix-heuristic branch August 27, 2021 21:25

ndellingwood mentioned this pull request Aug 30, 2021

Nightly test failures: cuda.batched_scalar_teamvector_*_double in clang/8+cuda/10.0 builds kokkos/kokkos-kernels#1089

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cuda improve heuristic for blocksize #4271

Cuda improve heuristic for blocksize #4271

crtrott commented Aug 27, 2021

masterleinad commented Aug 27, 2021

dalg24 commented Aug 27, 2021

masterleinad commented Aug 27, 2021

DavidPoliakoff commented Aug 27, 2021

crtrott commented Aug 27, 2021 •

edited

DavidPoliakoff commented Aug 27, 2021 via email

masterleinad left a comment

DavidPoliakoff left a comment

crtrott commented Aug 27, 2021

Cuda improve heuristic for blocksize #4271

Cuda improve heuristic for blocksize #4271

Conversation

crtrott commented Aug 27, 2021

masterleinad commented Aug 27, 2021

dalg24 commented Aug 27, 2021

masterleinad commented Aug 27, 2021

DavidPoliakoff commented Aug 27, 2021

crtrott commented Aug 27, 2021 • edited

DavidPoliakoff commented Aug 27, 2021 via email

masterleinad left a comment

Choose a reason for hiding this comment

DavidPoliakoff left a comment

Choose a reason for hiding this comment

crtrott commented Aug 27, 2021

crtrott commented Aug 27, 2021 •

edited