cuda_internal_maximum_warp_count returns 8, but I believe it should return 16 for P100 #1269

ambrad · 2017-12-11T19:24:03Z

I believe the maximum number of threads in a thread block on P100 is 32*16 = 512. I thought that cuda_internal_maximum_warp_count would return 512/Impl::CudaTraits::WarpSize = 512/32 = 16, but it returns 8. Is this a bug, or am I misinterpreting the purpose of the function? Thanks.

crtrott · 2017-12-22T00:44:59Z

Let me check this.

crtrott · 2018-01-04T18:49:48Z

Wow yeat, there is a comment from early 2012 on this. I think this is outdated, and also probably not used anywhere ...

Also make range parallel for use the "find optimal block size" function Improves miniMD benchmark suite runs by 5 and 9% respectively at the cost of 1% on miniFE.

crtrott · 2018-01-28T04:25:05Z

This is now fixed. I also changed the behavior of the normal RangePolicy to use a better heuristic for what the block size should be. Furthermore the old heuristic did NOT take register utilization into account, and could thus fail for very complex kernels. We were just lucky that most really complex kernels were already using Hierarchical Parallelism, which was using the better heuristics already.

…e-1206 * 'issue-1206' of github.com:ndellingwood/kokkos: Issue kokkos#1206 - fix order of args to DynamicView in test_sort Issue kokkos#1206: Fix DynamicView API in test_sort in algorithms DynamicView: Address issue kokkos#1206 Attempt to get rid of warning Fix issue in deep_copy changes Fix an issue with the benchmark suite after changes in macros Fix warning with CUDA for OpenMP nthreads unused variable Fix issue kokkos#1269 Fix deep_copy between empty views issue kokkos#1369 Adding OpenMP InterOp test issue kokkos#1305 Fix CUDA interoperability and add unit test Fix issue kokkos#1363 : Deepcopy between rank-1 views with LayoutLeft/Right Adding ChunkSize constructor overload to RangePolicy. Error out when -arch not detected

ibaned added the Enhancement Improve existing capability; will potentially require voting label Dec 12, 2017

ibaned assigned crtrott Dec 12, 2017

ibaned added this to the 2018 February milestone Dec 12, 2017

ambrad mentioned this issue Jan 23, 2018

Kokkos --with-cuda --debug reports "too large team size" E3SM-Project/HOMMEXX#148

Closed

2 tasks

crtrott added a commit that referenced this issue Jan 27, 2018

Fix issue #1269

788132b

Also make range parallel for use the "find optimal block size" function Improves miniMD benchmark suite runs by 5 and 9% respectively at the cost of 1% on miniFE.

ibaned mentioned this issue Jan 27, 2018

Issues 1370 1369 1269 #1371

Merged

crtrott added the InDevelop label Jan 28, 2018

ndellingwood closed this as completed Mar 7, 2018

ndellingwood mentioned this issue Mar 8, 2018

Kokkos + KokkosKernels Promotion To Version 2.6.00 trilinos/Trilinos#2351

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cuda_internal_maximum_warp_count returns 8, but I believe it should return 16 for P100 #1269

cuda_internal_maximum_warp_count returns 8, but I believe it should return 16 for P100 #1269

ambrad commented Dec 11, 2017

crtrott commented Dec 22, 2017

crtrott commented Jan 4, 2018

crtrott commented Jan 28, 2018

cuda_internal_maximum_warp_count returns 8, but I believe it should return 16 for P100 #1269

cuda_internal_maximum_warp_count returns 8, but I believe it should return 16 for P100 #1269

Comments

ambrad commented Dec 11, 2017

crtrott commented Dec 22, 2017

crtrott commented Jan 4, 2018

crtrott commented Jan 28, 2018