Reduction on GPU (numba.cuda) vs CPU yield different result #2480

MikeQUS · 2017-07-27T19:09:14Z

Why am I seeing different result on GPU compare to sequential CPU when using this reduction lambda?

import numpy
from numba import cuda
from functools import reduce

A = (numpy.arange(10, dtype=numpy.float64)) + 1
cuda.reduce(lambda a, b: a + b * 20)(A) 
# result 71381.0
reduce(lambda a, b: a + b * 20, A) 
# result 1081.0

seibert · 2017-07-27T19:30:37Z

I get 1081.0 from both CPU and GPU when I run this on my test system.

Can you show the output from numba -s?

MikeQUS · 2017-07-27T19:58:47Z

__CUDA Information__
Found 1 CUDA devices
id 0     GeForce [hidden]                               [SUPPORTED]
                      compute capability: 6.1
                           pci device id: 0
                              pci bus id: 1
Summary:
	1/1 devices are supported
CUDA driver version   : 9000
CUDA libraries:
Finding cublas
	ERROR: can't locate lib
Finding cusparse
	ERROR: can't locate lib
Finding cufft
	ERROR: can't locate lib
Finding curand
	ERROR: can't locate lib
Finding nvvm
	ERROR: can't locate lib
	finding libdevice for compute_20...	ERROR: can't open libdevice for compute_20
	finding libdevice for compute_30...	ERROR: can't open libdevice for compute_30
	finding libdevice for compute_35...	ERROR: can't open libdevice for compute_35
	finding libdevice for compute_50...	ERROR: can't open libdevice for compute_50

seibert · 2017-07-27T20:01:34Z

OK, interesting to see you are using the CUDA 9 pre-release and libNVVM is not being detected. I would have assumed Numba would raise an exception rather try to run. I'll see if we can try out CUDA 9 on a test machine and figure out if there is some problem we're not aware of yet.

seibert · 2017-07-27T20:07:00Z

As a side note: cuda.reduce was rewritten for Numba 0.34 to improve performance. I'm curious if it fails the same way for you.

MikeQUS · 2017-07-27T20:09:56Z

well, I tried it on a different machine with CUDA 7 and I got the wrong answer, same as the one in the question
Here is the spec of the other machine:


__CUDA Information__
Found 1 CUDA devices
id 0      b'GeForce GT [hidden]'                              [SUPPORTED]
                      compute capability: 3.0
                           pci device id: 0
                              pci bus id: 1
Summary:
	1/1 devices are supported
CUDA driver version   : 7050
CUDA libraries:
Finding cublas
	named  libcublas.7.5.dylib
	trying to open library...	ok
Finding cusparse
	named  libcusparse.7.5.dylib
	trying to open library...	ok
Finding cufft
	named  libcufft.7.5.dylib
	trying to open library...	ok
Finding curand
	named  libcurand.7.5.dylib
	trying to open library...	ok
Finding nvvm
	named  libnvvm.3.0.0.dylib
	trying to open library...	ok
	finding libdevice for compute_20...	ok
	finding libdevice for compute_30...	ok
	finding libdevice for compute_35...	ok
	finding libdevice for compute_50...	ok

seibert · 2017-07-27T20:15:43Z

OK, I can confirm your bug when I run with Numba 0.33.0. This was fixed in Numba 0.34.0.

MikeQUS · 2017-07-27T20:18:32Z

Interestingly, I tried the stream functionality on Java to leverage parallel reduce and got the same answer as CUDA but different than sequential reduce

int n = 10;
float inputArray[] = new float[n];
ArrayList<Float> inputList = new ArrayList<Float>();
for (int i=0; i<n; i++)
{
    inputArray[i] = i+1;
    inputList.add(inputArray[i]);
}
Optional<Float> resultStream = inputList.stream().parallel().reduce((x, y) -> x+y*20);
float resultCPU = array[0];
for (int i = 1; i < array.length; i++)
{
    resultCPU = resultCPU + array[i] * 20;            
}
System.out.println("CPU "+resultCPU); // CPU 10541.0
System.out.println("Stream "+resultStream.get()); // Stream 1.2466232E8

MikeQUS · 2017-07-27T20:31:51Z

I installed Numba 0.34.0 and you are right results are correct for the example I gave in the question.
Although, when you make the array larger, the problem reappear:

import numpy
from numba import cuda
from functools import reduce

A = (numpy.arange(100, dtype=numpy.float64)) + 1
cuda.reduce(lambda a, b: a + b * 20)(A) 
# result 12952749821.0
reduce(lambda a, b: a + b * 20, A) 
# result 100981.0

import numba
numba.__version__
# '0.34.0+5.g1762237'

seibert · 2017-07-27T20:33:59Z

@sklam: Can you take a look at this?

MikeQUS · 2017-07-27T20:39:31Z

@seibert Thanks by the way for your fast response and help
I would like to remove my systems spec and just leave the parts you were interested in

seibert · 2017-07-27T20:42:22Z

@sklam just pointed out to me that parallel reductions require a commutative reduction function. Your lambda function is not commutative:

lambda a, b: a + b * 20

This also explains why you see the same strange behavior in Java, which is probably doing a tree-based reduction similar to the GPU. The rewrite of cuda.reduce between 0.33 and 0.34 was a red herring as it just changed the order of operations and accidentally made the size 10 case work, as you noted.

So, basically, for this kernel, parallel reduction can't work in any system.

MikeQUS · 2017-07-27T20:50:26Z

Make sense, thank you so much both of you @seibert @sklam for clarifying this to me.
It would be very interesting to find out how to identify commutative reduction within the compiler and throw and exception in those cases.

seibert assigned sklam Jul 27, 2017

seibert added this to the Numba 0.35 RC milestone Jul 27, 2017

MikeQUS closed this as completed Jul 27, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduction on GPU (numba.cuda) vs CPU yield different result #2480

Reduction on GPU (numba.cuda) vs CPU yield different result #2480

MikeQUS commented Jul 27, 2017

seibert commented Jul 27, 2017

MikeQUS commented Jul 27, 2017 •

edited

seibert commented Jul 27, 2017

seibert commented Jul 27, 2017

MikeQUS commented Jul 27, 2017 •

edited

seibert commented Jul 27, 2017

MikeQUS commented Jul 27, 2017

MikeQUS commented Jul 27, 2017

seibert commented Jul 27, 2017

MikeQUS commented Jul 27, 2017

seibert commented Jul 27, 2017

MikeQUS commented Jul 27, 2017

Reduction on GPU (numba.cuda) vs CPU yield different result #2480

Reduction on GPU (numba.cuda) vs CPU yield different result #2480

Comments

MikeQUS commented Jul 27, 2017

seibert commented Jul 27, 2017

MikeQUS commented Jul 27, 2017 • edited

seibert commented Jul 27, 2017

seibert commented Jul 27, 2017

MikeQUS commented Jul 27, 2017 • edited

seibert commented Jul 27, 2017

MikeQUS commented Jul 27, 2017

MikeQUS commented Jul 27, 2017

seibert commented Jul 27, 2017

MikeQUS commented Jul 27, 2017

seibert commented Jul 27, 2017

MikeQUS commented Jul 27, 2017

MikeQUS commented Jul 27, 2017 •

edited

MikeQUS commented Jul 27, 2017 •

edited