New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduction on GPU (numba.cuda) vs CPU yield different result #2480
Comments
I get 1081.0 from both CPU and GPU when I run this on my test system. Can you show the output from |
|
OK, interesting to see you are using the CUDA 9 pre-release and libNVVM is not being detected. I would have assumed Numba would raise an exception rather try to run. I'll see if we can try out CUDA 9 on a test machine and figure out if there is some problem we're not aware of yet. |
As a side note: |
well, I tried it on a different machine with CUDA 7 and I got the wrong answer, same as the one in the question
|
OK, I can confirm your bug when I run with Numba 0.33.0. This was fixed in Numba 0.34.0. |
Interestingly, I tried the int n = 10;
float inputArray[] = new float[n];
ArrayList<Float> inputList = new ArrayList<Float>();
for (int i=0; i<n; i++)
{
inputArray[i] = i+1;
inputList.add(inputArray[i]);
}
Optional<Float> resultStream = inputList.stream().parallel().reduce((x, y) -> x+y*20);
float resultCPU = array[0];
for (int i = 1; i < array.length; i++)
{
resultCPU = resultCPU + array[i] * 20;
}
System.out.println("CPU "+resultCPU); // CPU 10541.0
System.out.println("Stream "+resultStream.get()); // Stream 1.2466232E8 |
I installed Numba 0.34.0 and you are right results are correct for the example I gave in the question. import numpy
from numba import cuda
from functools import reduce
A = (numpy.arange(100, dtype=numpy.float64)) + 1
cuda.reduce(lambda a, b: a + b * 20)(A)
# result 12952749821.0
reduce(lambda a, b: a + b * 20, A)
# result 100981.0
import numba
numba.__version__
# '0.34.0+5.g1762237' |
@sklam: Can you take a look at this? |
@seibert Thanks by the way for your fast response and help |
@sklam just pointed out to me that parallel reductions require a commutative reduction function. Your lambda function is not commutative:
This also explains why you see the same strange behavior in Java, which is probably doing a tree-based reduction similar to the GPU. The rewrite of So, basically, for this kernel, parallel reduction can't work in any system. |
Why am I seeing different result on GPU compare to sequential CPU when using this reduction lambda?
The text was updated successfully, but these errors were encountered: