Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cuda_internal_maximum_warp_count returns 8, but I believe it should return 16 for P100 #1269

Closed
ambrad opened this issue Dec 11, 2017 · 3 comments
Assignees
Labels
Enhancement Improve existing capability; will potentially require voting
Milestone

Comments

@ambrad
Copy link

ambrad commented Dec 11, 2017

I believe the maximum number of threads in a thread block on P100 is 32*16 = 512. I thought that cuda_internal_maximum_warp_count would return 512/Impl::CudaTraits::WarpSize = 512/32 = 16, but it returns 8. Is this a bug, or am I misinterpreting the purpose of the function? Thanks.

@ibaned ibaned added the Enhancement Improve existing capability; will potentially require voting label Dec 12, 2017
@ibaned ibaned added this to the 2018 February milestone Dec 12, 2017
@crtrott
Copy link
Member

crtrott commented Dec 22, 2017

Let me check this.

@crtrott
Copy link
Member

crtrott commented Jan 4, 2018

Wow yeat, there is a comment from early 2012 on this. I think this is outdated, and also probably not used anywhere ...

crtrott added a commit that referenced this issue Jan 27, 2018
Also make range parallel for use the "find optimal block size" function
Improves miniMD benchmark suite runs by 5 and 9% respectively at the cost
of 1% on miniFE.
@crtrott
Copy link
Member

crtrott commented Jan 28, 2018

This is now fixed. I also changed the behavior of the normal RangePolicy to use a better heuristic for what the block size should be. Furthermore the old heuristic did NOT take register utilization into account, and could thus fail for very complex kernels. We were just lucky that most really complex kernels were already using Hierarchical Parallelism, which was using the better heuristics already.

ndellingwood added a commit to ndellingwood/kokkos that referenced this issue Feb 1, 2018
…e-1206

* 'issue-1206' of github.com:ndellingwood/kokkos:
  Issue kokkos#1206 - fix order of args to DynamicView in test_sort
  Issue kokkos#1206: Fix DynamicView API in test_sort in algorithms
  DynamicView: Address issue kokkos#1206
  Attempt to get rid of warning
  Fix issue in deep_copy changes
  Fix an issue with the benchmark suite after changes in macros
  Fix warning with CUDA for OpenMP nthreads unused variable
  Fix issue kokkos#1269
  Fix deep_copy between empty views issue kokkos#1369
  Adding OpenMP InterOp test issue kokkos#1305
  Fix CUDA interoperability and add unit test
  Fix issue kokkos#1363 : Deepcopy between rank-1 views with LayoutLeft/Right
  Adding ChunkSize constructor overload to RangePolicy.
  Error out when -arch not detected
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Improve existing capability; will potentially require voting
Projects
None yet
Development

No branches or pull requests

4 participants