wrong results for a parallel_reduce with CUDA8 / Maxwell50 #352

pkestene · 2016-07-07T20:59:44Z

I have 2 systems:

same software stack: Ubuntu 16.04 + cuda/8.0
one with a old GPU (sm_30)
another with a more recent one (sm_50)

On the old GPU, unit test cuda.reduce is ok as well as example/tutorial/02_simple_reduce

However on the newer GPU (sm_50 / K2200), unit test cuda.reduce is ok, but 02_simple_reduce gives wrong results.
I tried printing from inside reduce kernel, and the printed values are OK but as soon as kernel has finished, the reduce final result is wrong, as if the result in GPU memory was OK, but not transfered back in host memory (?).

I checked and rechecked CUDA arch flag to make sure, i didn't mess up the build flags.

Am I doing something possibly wrong here ?

pkestene · 2016-07-08T08:51:22Z

On the platform with a Maxwell50 GPU,I used cuda-memcheck with tool racecheck, and got issues coming from macro BLOCK_REDUCE_STEP (intra-warp reduction):

(these issue are not present, where running the simple reduction code on Kepler30 hardware)

========= CUDA-MEMCHECK
========= Race reported between Write access at 0x000005d8 in /home/pkestene/local/kokkos_cuda_dev/include/Cuda/Kokkos_Cuda_ReduceScan.hpp:264:ZN6Kokkos4Impl33cuda_parallel_launch_local_memoryINS0_14ParallelReduceI9squaresumNS_11RangePolicyIJNS_4CudaEEEENS_11InvalidTypeES5_EEEEvT
========= and Read access at 0x000005f0 in /home/pkestene/local/kokkos_cuda_dev/include/Cuda/Kokkos_Cuda_ReduceScan.hpp:265:ZN6Kokkos4Impl33cuda_parallel_launch_local_memoryINS0_14ParallelReduceI9squaresumNS_11RangePolicyIJNS_4CudaEEEENS_11InvalidTypeES5_EEEEvT [64 hazards]

Just for checking, I also rebuilt kokkos with arch Kepler30 and run it on actual Maxwell50 hardware, and the problem is still there.

ndellingwood · 2016-07-08T15:54:51Z

This is related to issue #196, I delayed looking into this due to other higher priority issues but will get back to this soon.

This is a workaround which hopefully addresses the reduction issues we have seen and reported in issue #352, #398 and #196.

crtrott · 2016-08-29T17:12:24Z

Ok I think I might have identified the issue (which might be a bug in Cuda) see issue #398.
If you could try and see if the latest development branch works now, that would be awesome. You have to compile for Maxwell (i.e. CC 5.0 or higher) to get the workaround.

pkestene · 2016-08-29T20:25:28Z

Hi Christian,

Thank you for looking into this.
I confirm that the reduce results are now OK on my Maxwell 5.0 platform.

crtrott · 2016-08-29T23:37:27Z

Great. What I am going to do is to mark this issue as resolved, but I will keep the related Pascal issue open in order to track a real fix which doesn't hurt performance (that said the current fix only hurts performance for large scalar values (i.e. > 64bit) below that it should be a wash).

hcedwar added the Bug Broken / incorrect code; it could be Kokkos' responsibility, or others’ (e.g., Trilinos) label Jul 20, 2016

hcedwar added this to the Backlog milestone Jul 20, 2016

crtrott added a commit that referenced this issue Aug 29, 2016

Core: Bug workaround for Reduction on Pascal and maybe Maxwell

2452da1

This is a workaround which hopefully addresses the reduction issues we have seen and reported in issue #352, #398 and #196.

crtrott mentioned this issue Aug 29, 2016

Unit tests with Cuda - Maxwell #196

Closed

crtrott added bug - fix pushed to develop branch labels Aug 29, 2016

crtrott closed this as completed Sep 3, 2016

crtrott modified the milestones: Fall 2016, Backlog Sep 19, 2016

crtrott self-assigned this Sep 20, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

wrong results for a parallel_reduce with CUDA8 / Maxwell50 #352

wrong results for a parallel_reduce with CUDA8 / Maxwell50 #352

pkestene commented Jul 7, 2016

pkestene commented Jul 8, 2016

ndellingwood commented Jul 8, 2016

crtrott commented Aug 29, 2016

pkestene commented Aug 29, 2016

crtrott commented Aug 29, 2016

wrong results for a parallel_reduce with CUDA8 / Maxwell50 #352

wrong results for a parallel_reduce with CUDA8 / Maxwell50 #352

Comments

pkestene commented Jul 7, 2016

pkestene commented Jul 8, 2016

ndellingwood commented Jul 8, 2016

crtrott commented Aug 29, 2016

pkestene commented Aug 29, 2016

crtrott commented Aug 29, 2016