Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduction on GPU (numba.cuda) vs CPU yield different result #2480

Closed
MikeQUS opened this issue Jul 27, 2017 · 12 comments
Closed

Reduction on GPU (numba.cuda) vs CPU yield different result #2480

MikeQUS opened this issue Jul 27, 2017 · 12 comments
Assignees
Milestone

Comments

@MikeQUS
Copy link

MikeQUS commented Jul 27, 2017

Why am I seeing different result on GPU compare to sequential CPU when using this reduction lambda?

import numpy
from numba import cuda
from functools import reduce

A = (numpy.arange(10, dtype=numpy.float64)) + 1
cuda.reduce(lambda a, b: a + b * 20)(A) 
# result 71381.0
reduce(lambda a, b: a + b * 20, A) 
# result 1081.0
@seibert
Copy link
Contributor

seibert commented Jul 27, 2017

I get 1081.0 from both CPU and GPU when I run this on my test system.

Can you show the output from numba -s?

@MikeQUS
Copy link
Author

MikeQUS commented Jul 27, 2017

__CUDA Information__
Found 1 CUDA devices
id 0     GeForce [hidden]                               [SUPPORTED]
                      compute capability: 6.1
                           pci device id: 0
                              pci bus id: 1
Summary:
	1/1 devices are supported
CUDA driver version   : 9000
CUDA libraries:
Finding cublas
	ERROR: can't locate lib
Finding cusparse
	ERROR: can't locate lib
Finding cufft
	ERROR: can't locate lib
Finding curand
	ERROR: can't locate lib
Finding nvvm
	ERROR: can't locate lib
	finding libdevice for compute_20...	ERROR: can't open libdevice for compute_20
	finding libdevice for compute_30...	ERROR: can't open libdevice for compute_30
	finding libdevice for compute_35...	ERROR: can't open libdevice for compute_35
	finding libdevice for compute_50...	ERROR: can't open libdevice for compute_50

@seibert
Copy link
Contributor

seibert commented Jul 27, 2017

OK, interesting to see you are using the CUDA 9 pre-release and libNVVM is not being detected. I would have assumed Numba would raise an exception rather try to run. I'll see if we can try out CUDA 9 on a test machine and figure out if there is some problem we're not aware of yet.

@seibert
Copy link
Contributor

seibert commented Jul 27, 2017

As a side note: cuda.reduce was rewritten for Numba 0.34 to improve performance. I'm curious if it fails the same way for you.

@MikeQUS
Copy link
Author

MikeQUS commented Jul 27, 2017

well, I tried it on a different machine with CUDA 7 and I got the wrong answer, same as the one in the question
Here is the spec of the other machine:


__CUDA Information__
Found 1 CUDA devices
id 0      b'GeForce GT [hidden]'                              [SUPPORTED]
                      compute capability: 3.0
                           pci device id: 0
                              pci bus id: 1
Summary:
	1/1 devices are supported
CUDA driver version   : 7050
CUDA libraries:
Finding cublas
	named  libcublas.7.5.dylib
	trying to open library...	ok
Finding cusparse
	named  libcusparse.7.5.dylib
	trying to open library...	ok
Finding cufft
	named  libcufft.7.5.dylib
	trying to open library...	ok
Finding curand
	named  libcurand.7.5.dylib
	trying to open library...	ok
Finding nvvm
	named  libnvvm.3.0.0.dylib
	trying to open library...	ok
	finding libdevice for compute_20...	ok
	finding libdevice for compute_30...	ok
	finding libdevice for compute_35...	ok
	finding libdevice for compute_50...	ok


@seibert
Copy link
Contributor

seibert commented Jul 27, 2017

OK, I can confirm your bug when I run with Numba 0.33.0. This was fixed in Numba 0.34.0.

@MikeQUS
Copy link
Author

MikeQUS commented Jul 27, 2017

Interestingly, I tried the stream functionality on Java to leverage parallel reduce and got the same answer as CUDA but different than sequential reduce

int n = 10;
float inputArray[] = new float[n];
ArrayList<Float> inputList = new ArrayList<Float>();
for (int i=0; i<n; i++)
{
    inputArray[i] = i+1;
    inputList.add(inputArray[i]);
}
Optional<Float> resultStream = inputList.stream().parallel().reduce((x, y) -> x+y*20);
float resultCPU = array[0];
for (int i = 1; i < array.length; i++)
{
    resultCPU = resultCPU + array[i] * 20;            
}
System.out.println("CPU "+resultCPU); // CPU 10541.0
System.out.println("Stream "+resultStream.get()); // Stream 1.2466232E8

@MikeQUS
Copy link
Author

MikeQUS commented Jul 27, 2017

I installed Numba 0.34.0 and you are right results are correct for the example I gave in the question.
Although, when you make the array larger, the problem reappear:

import numpy
from numba import cuda
from functools import reduce

A = (numpy.arange(100, dtype=numpy.float64)) + 1
cuda.reduce(lambda a, b: a + b * 20)(A) 
# result 12952749821.0
reduce(lambda a, b: a + b * 20, A) 
# result 100981.0

import numba
numba.__version__
# '0.34.0+5.g1762237'

@seibert
Copy link
Contributor

seibert commented Jul 27, 2017

@sklam: Can you take a look at this?

@seibert seibert added this to the Numba 0.35 RC milestone Jul 27, 2017
@MikeQUS
Copy link
Author

MikeQUS commented Jul 27, 2017

@seibert Thanks by the way for your fast response and help
I would like to remove my systems spec and just leave the parts you were interested in

@seibert
Copy link
Contributor

seibert commented Jul 27, 2017

@sklam just pointed out to me that parallel reductions require a commutative reduction function. Your lambda function is not commutative:

lambda a, b: a + b * 20

This also explains why you see the same strange behavior in Java, which is probably doing a tree-based reduction similar to the GPU. The rewrite of cuda.reduce between 0.33 and 0.34 was a red herring as it just changed the order of operations and accidentally made the size 10 case work, as you noted.

So, basically, for this kernel, parallel reduction can't work in any system.

@MikeQUS
Copy link
Author

MikeQUS commented Jul 27, 2017

Make sense, thank you so much both of you @seibert @sklam for clarifying this to me.
It would be very interesting to find out how to identify commutative reduction within the compiler and throw and exception in those cases.

@MikeQUS MikeQUS closed this as completed Jul 27, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants