Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wrong results for a parallel_reduce with CUDA8 / Maxwell50 #352

Closed
pkestene opened this issue Jul 7, 2016 · 5 comments
Closed

wrong results for a parallel_reduce with CUDA8 / Maxwell50 #352

pkestene opened this issue Jul 7, 2016 · 5 comments
Assignees
Labels
Bug Broken / incorrect code; it could be Kokkos' responsibility, or others’ (e.g., Trilinos)
Milestone

Comments

@pkestene
Copy link
Contributor

pkestene commented Jul 7, 2016

I have 2 systems:

  • same software stack: Ubuntu 16.04 + cuda/8.0
  • one with a old GPU (sm_30)
  • another with a more recent one (sm_50)

On the old GPU, unit test cuda.reduce is ok as well as example/tutorial/02_simple_reduce

However on the newer GPU (sm_50 / K2200), unit test cuda.reduce is ok, but 02_simple_reduce gives wrong results.
I tried printing from inside reduce kernel, and the printed values are OK but as soon as kernel has finished, the reduce final result is wrong, as if the result in GPU memory was OK, but not transfered back in host memory (?).

I checked and rechecked CUDA arch flag to make sure, i didn't mess up the build flags.

Am I doing something possibly wrong here ?

@pkestene
Copy link
Contributor Author

pkestene commented Jul 8, 2016

On the platform with a Maxwell50 GPU,I used cuda-memcheck with tool racecheck, and got issues coming from macro BLOCK_REDUCE_STEP (intra-warp reduction):

(these issue are not present, where running the simple reduction code on Kepler30 hardware)

========= CUDA-MEMCHECK
========= Race reported between Write access at 0x000005d8 in /home/pkestene/local/kokkos_cuda_dev/include/Cuda/Kokkos_Cuda_ReduceScan.hpp:264:ZN6Kokkos4Impl33cuda_parallel_launch_local_memoryINS0_14ParallelReduceI9squaresumNS_11RangePolicyIJNS_4CudaEEEENS_11InvalidTypeES5_EEEEvT
========= and Read access at 0x000005f0 in /home/pkestene/local/kokkos_cuda_dev/include/Cuda/Kokkos_Cuda_ReduceScan.hpp:265:ZN6Kokkos4Impl33cuda_parallel_launch_local_memoryINS0_14ParallelReduceI9squaresumNS_11RangePolicyIJNS_4CudaEEEENS_11InvalidTypeES5_EEEEvT [64 hazards]

Just for checking, I also rebuilt kokkos with arch Kepler30 and run it on actual Maxwell50 hardware, and the problem is still there.

@ndellingwood
Copy link
Contributor

This is related to issue #196, I delayed looking into this due to other higher priority issues but will get back to this soon.

@hcedwar hcedwar added the Bug Broken / incorrect code; it could be Kokkos' responsibility, or others’ (e.g., Trilinos) label Jul 20, 2016
@hcedwar hcedwar added this to the Backlog milestone Jul 20, 2016
crtrott added a commit that referenced this issue Aug 29, 2016
This is a workaround which hopefully addresses the reduction issues
we have seen and reported in issue #352, #398 and #196.
@crtrott
Copy link
Member

crtrott commented Aug 29, 2016

Ok I think I might have identified the issue (which might be a bug in Cuda) see issue #398.
If you could try and see if the latest development branch works now, that would be awesome. You have to compile for Maxwell (i.e. CC 5.0 or higher) to get the workaround.

@pkestene
Copy link
Contributor Author

Hi Christian,

Thank you for looking into this.
I confirm that the reduce results are now OK on my Maxwell 5.0 platform.

@crtrott
Copy link
Member

crtrott commented Aug 29, 2016

Great. What I am going to do is to mark this issue as resolved, but I will keep the related Pascal issue open in order to track a real fix which doesn't hurt performance (that said the current fix only hurts performance for large scalar values (i.e. > 64bit) below that it should be a wash).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Broken / incorrect code; it could be Kokkos' responsibility, or others’ (e.g., Trilinos)
Projects
None yet
Development

No branches or pull requests

4 participants