Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cuda improve heuristic for blocksize #4271

Merged
merged 2 commits into from
Aug 27, 2021
Merged

Conversation

crtrott
Copy link
Member

@crtrott crtrott commented Aug 27, 2021

This also updated the stream benchmark, necessary to demonstrate the benefit.

@crtrott crtrott added this to In progress in Kokkos Release 3.5 via automation Aug 27, 2021
@crtrott crtrott moved this from In progress to Awaiting Feedback in Kokkos Release 3.5 Aug 27, 2021
@crtrott crtrott added the Blocks Promotion Overview issue for release-blocking bugs label Aug 27, 2021
@masterleinad
Copy link
Contributor

What are the improvements you are getting when running the benchmark?

@dalg24
Copy link
Member

dalg24 commented Aug 27, 2021

What are the improvements you are getting when running the benchmark?

I was about to ask the same but see the commit message ea37ea3

benchmarks/stream/stream-kokkos.cpp Outdated Show resolved Hide resolved
benchmarks/stream/stream-kokkos.cpp Outdated Show resolved Hide resolved
@masterleinad
Copy link
Contributor

What are the improvements you are getting when running the benchmark?

I was about to ask the same but see the commit message ea37ea3

Some experiments deomnstrated that for certain kernels the
current heuristic isn't great. In particular copy and memset
kernels were bad.

Using the updated stream benchmark I got before this change:

Set               654385.49 MB/s
Copy              654385.49 MB/s
Scale             654398.87 MB/s
Add               846436.34 MB/s
Triad             844568.74 MB/s

With this change:

Set               806107.54 MB/s
Copy              805491.91 MB/s
Scale             807181.31 MB/s
Add               845471.17 MB/s
Triad             845531.05 MB/s

ExaminidMD also improved from 2.48e+08 to 2.82e+08:

1 256000 | 0.906401 0.480328 0.142917 0.165107 0.117937 | 1103.264687 2.824358e+08 2.824358e+08 PERFORMANCE

1 256000 | 1.030611 0.501819 0.243033 0.163163 0.122484 | 970.297956 2.483963e+08 2.483963e+08 PERFORMANCE

@DavidPoliakoff
Copy link
Contributor

@crtrott: that matches the best autotuning numbers I can get

@crtrott
Copy link
Member Author

crtrott commented Aug 27, 2021

yeah but it was wrong :-) (Daniel noted that)

 Set               327316.30 MB/s
    Copy              654344.27 MB/s
    Scale             654263.20 MB/s
    Add               846497.84 MB/s
    Triad             844604.40 MB/s

    With this change:

    Set               652713.29 MB/s
    Copy              807649.65 MB/s
    Scale             808014.29 MB/s
    Add               847403.47 MB/s
    Triad             845885.63 MB/s

This is the real number. To get the 807 with set you need a block size of 256, but that has more detremential impact for more complex kernels. So I thought we go with 128, which only leaves kernels which do a single memory op per thread of by 25%.

@DavidPoliakoff
Copy link
Contributor

DavidPoliakoff commented Aug 27, 2021 via email

Some experiments deomnstrated that for certain kernels the
current heuristic isn't great. In particular copy and memset
kernels were bad.

Using the updated stream benchmark I got before this change:

Set               327316.30 MB/s
Copy              654344.27 MB/s
Scale             654263.20 MB/s
Add               846497.84 MB/s
Triad             844604.40 MB/s

With this change:

Set               652713.29 MB/s
Copy              807649.65 MB/s
Scale             808014.29 MB/s
Add               847403.47 MB/s
Triad             845885.63 MB/s

ExaminidMD also improved from 2.48e+08 to 2.82e+08:

1 256000 | 0.906401 0.480328 0.142917 0.165107 0.117937 | 1103.264687 2.824358e+08 2.824358e+08 PERFORMANCE

1 256000 | 1.030611 0.501819 0.243033 0.163163 0.122484 | 970.297956 2.483963e+08 2.483963e+08 PERFORMANCE
Copy link
Contributor

@masterleinad masterleinad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks OK to me.

Copy link
Contributor

@DavidPoliakoff DavidPoliakoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should make a note in the release about this, in case some people have a bad reaction

@crtrott
Copy link
Member Author

crtrott commented Aug 27, 2021

I also test 256 with ExaMIniMD and its slower than 128:

Here for 3 different sizes (20^3, 30^3 and 40^3, i.e. 32k atoms, ~100k atoms and 256k atoms)

    64   128  256
20  1.04 1.05 1.03
30  1.94 2.14 2.01
40  2.49 2.84 2.70

@dalg24 dalg24 merged commit 52d1c93 into kokkos:develop Aug 27, 2021
Kokkos Release 3.5 automation moved this from Awaiting Feedback to Done Aug 27, 2021
@crtrott crtrott deleted the cuda-fix-heuristic branch August 27, 2021 21:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Blocks Promotion Overview issue for release-blocking bugs
Projects
Development

Successfully merging this pull request may close these issues.

None yet

4 participants