Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Caching cudaFunctorAttributes and whether L1/Shmem prefer was set #3151

Merged
merged 1 commit into from
Jul 3, 2020

Conversation

crtrott
Copy link
Member

@crtrott crtrott commented Jul 3, 2020

This reduces launch latencies, and closes the gap between a raw CUDA
launch and Kokkos Launch to about 0.2us (empty Kernel parallel_for).

Test case comparison the different times are non-fenced, and fenced after each kernel call. In parenthesis fenced with an extra fence after the Repeats loop (doesn't matter, since I can't enqueue so many kernels anyway, so it adds like 5us/Repeats).

Old:

LoopCount FunctorCArraySize InnerLoopCount Repeats
10000 1 1 10000 raw: 3.360421 (3.360927) parallel_for: 4.337152 9.894000 ( 4.337722 9.894113 ) parallel_reduce: 14.843851 15.885946 ( 14.844018 15.886035 ) parallel_reduce(view): 5.188312 12.267948 ( 5.779884 12.268049 )
10000 16 10 10000 raw: 3.291513 (3.292117) parallel_for: 4.385937 9.595486 ( 4.386451 9.595609 ) parallel_reduce: 14.408576 15.886102 ( 14.408739 15.886200 ) parallel_reduce(view): 5.352994 12.819591 ( 5.967374 12.819692 )
10000 200 10 10000 raw: 3.517541 (3.518094) parallel_for: 10.905219 12.708534 ( 10.907358 12.708635 ) parallel_reduce: 18.430118 19.563913 ( 18.430311 19.564008 ) parallel_reduce(view): 14.289269 15.626251 ( 14.290132 15.626348 )
10000 3000 10 10000 raw: 3.544100 (3.544642) parallel_for: 13.466383 15.665507 ( 13.466974 15.665618 ) parallel_reduce: 22.444786 23.621668 ( 22.444963 23.621766 ) parallel_reduce(view): 16.727474 20.115662 ( 16.728418 20.115771 )

New

LoopCount FunctorCArraySize InnerLoopCount Repeats
10000 1 1 10000 raw: 3.388423 (3.388938) parallel_for: 3.611583 9.004801 ( 3.612182 9.004904 ) parallel_reduce: 13.554614 15.010310 ( 13.554782 15.010404 ) parallel_reduce(view): 5.208067 11.574165 ( 5.807058 11.574267 )
10000 16 10 10000 raw: 3.320113 (3.320646) parallel_for: 3.703650 8.824116 ( 3.704197 8.824227 ) parallel_reduce: 13.593689 15.010715 ( 13.593853 15.010810 ) parallel_reduce(view): 5.392943 12.073576 ( 6.007791 12.073711 )
10000 200 10 10000 raw: 3.537305 (3.537863) parallel_for: 10.890026 12.105710 ( 10.890533 12.105827 ) parallel_reduce: 17.602991 18.618286 ( 17.603149 18.618372 ) parallel_reduce(view): 14.308791 14.845797 ( 14.309610 14.845891 )
10000 3000 10 10000 raw: 3.538977 (3.539568) parallel_for: 13.337157 14.805056 ( 13.337811 14.805167 ) parallel_reduce: 21.849866 22.893569 ( 21.850049 22.893656 ) parallel_reduce(view): 16.641581 19.349426 ( 16.642512 19.349533 )

This reduces launch latencies, and closes the gap between a raw CUDA
launch and Kokkos Launch to about 0.2us (empty Kernel parallel_for).
@crtrott crtrott merged commit 22d3757 into kokkos:develop Jul 3, 2020
@crtrott crtrott deleted the cuda-kernel-launch-improve branch December 19, 2022 17:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants