Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Numba CUDA vectorize and reduce decorators slower than expected #2266

Closed
david-hoffman opened this issue Feb 3, 2017 · 2 comments
Closed

Comments

@david-hoffman
Copy link

I posted this originally on stackoverflow, but I thought that this might be a better forum.

I've been testing out some basic CUDA functions using the Numba package. My main goal is to implement a Richardson-Lucy algorithm on the GPU. It is possible to accelerate the algorithm and one of the main steps in doing so can be summarized in the following dummy function

    def dummy(arr1, arr2):
        return (arr1 * arr2).sum() / ((arr2**2).sum() + eps)

This function runs reasonably fast on the CPU but I'd like to keep everything on the GPU to avoid host <---> device copies.

To compare the speeds of the different calculations I wrote a short set of functions:

    import numpy as np
    from numba import njit, jit
    import numba
    import numba.cuda as cuda
    import timeit
    import time


    # define our functions
    @numba.vectorize(["float32(float32, float32)", "float64(float64, float64)"], target="cuda")
    def add_gpu(a, b):
        return a + b

    @numba.vectorize(["float32(float32, float32)", "float64(float64, float64)"], target="cuda")
    def mult_gpu(a, b):
        return a * b

    @cuda.reduce
    def sum_gpu(a, b):
        return a + b

    @cuda.jit
    def add_gpu_1d(a, b, c):
        x = cuda.grid(1)
        if x < c.size:
            c[x] = a[x] + b[x]

    @cuda.jit
    def mult_gpu_1d(a, b, c):
        x = cuda.grid(1)
        if x < c.size:
            c[x] = a[x] * b[x]

    @cuda.jit
    def mult_gpu_2d(a, b, c):
        x, y = cuda.grid(2)
        if x < c.shape[0] and y < c.shape[1]:
            c[x, y] = a[x, y] * b[x, y]

    @cuda.jit
    def add_gpu_2d(a, b, c):
        x, y = cuda.grid(2)
        if x < c.shape[0] and y < c.shape[1]:
            c[x, y] = a[x, y] + b[x, y]

and some timer functions:

    def avg_t(t, num):
        return np.mean(t) / num

    def format_t(t):
        """Turn t into nice formating"""
        if t < 1e-3:
            return "{:.1f} us".format(t * 1e6)
        elif t < 1:
            return "{:.1f} ms".format(t * 1e3)
        else:
            return "{:.1f} s".format(t)

    def test_1d_times(data_len, dtype=np.float32):
        num_times = 10

        title = "Testing 1D Data, Data length = {}, data type = {}".format(data_len, dtype)
        print(len(title) * "=")
        print(title)
        print(len(title) * "=")

        t = time.time()
        arr1, arr2 = np.empty((2, data_len), dtype=dtype)
        d_arr1 = cuda.to_device(arr1)
        d_arr2 = cuda.to_device(arr2)
        d_result = cuda.device_array_like(d_arr1)
        print("Data generated in " + format_t(time.time() - t))
        print("d_arr1 dtype =", d_arr1.dtype)
        print("d_arr1 size = ", d_arr1.size)

        print()
        print("Testing multiplication times")
        print("----------------------------")

        t = timeit.repeat((lambda: arr1 * arr2), number=num_times)
        print("cpu/numpy time = " + format_t(avg_t(t, num_times)))

        t = timeit.repeat((lambda: mult_gpu(d_arr1, d_arr2)), number=num_times)
        print("cuda vectorize time = " + format_t(avg_t(t, num_times)))

        t= timeit.repeat((lambda: mult_gpu_1d(d_arr1, d_arr2, d_result)), number=num_times)
        print("cuda_mult_1d time = " + format_t(avg_t(t, num_times)))

        print()
        print("Testing sum times")
        print("------------------")

        t = timeit.repeat((lambda: arr1 + arr2), number=num_times)
        print("cpu/numpy time = " + format_t(avg_t(t, num_times)))

        t = timeit.repeat((lambda: add_gpu(d_arr1, d_arr2)), number=num_times)
        print("cuda vectorize time = " + format_t(avg_t(t, num_times)))

        t= timeit.repeat((lambda: add_gpu_1d(d_arr1, d_arr2, d_result)), number=num_times)
        print("cuda_add_1d time = " + format_t(avg_t(t, num_times)))

        print()
        print("Testing reduction times")
        print("-----------------------")

        t = timeit.repeat((lambda: arr1.sum()), number=num_times)
        print("cpu/numpy time = " + format_t(avg_t(t, num_times)))

        t = timeit.repeat((lambda: add_gpu.reduce(d_arr1)), number=num_times)
        print("cuda vectorize time = " + format_t(avg_t(t, num_times)))

        t = timeit.repeat((lambda: sum_gpu(d_arr1)), number=num_times)
        print("sum_gpu time = " + format_t(avg_t(t, num_times)))
        print()

    def test_2d_times(data_len, dtype=np.float32):
        num_times = 10

        title = "Testing 2D Data, Data length = {}, data type = {}".format(data_len, dtype)
        print(len(title) * "=")
        print(title)
        print(len(title) * "=")

        t = time.time()
        arr1, arr2 = np.empty((2, data_len, data_len), dtype=dtype)
        d_arr1 = cuda.to_device(arr1)
        d_arr2 = cuda.to_device(arr2)
        d_result = cuda.device_array_like(d_arr1)
        print("Data generated in {} seconds".format(time.time() - t))
        print("d_arr1 dtype =", d_arr1.dtype)
        print("d_arr1 size = ", d_arr1.size)

        print()
        print("Testing multiplication times")
        print("----------------------------")

        t = timeit.repeat((lambda: arr1 * arr2), number=num_times)
        print("cpu/numpy time = " + format_t(avg_t(t, num_times)))

        t = timeit.repeat((lambda: mult_gpu(d_arr1, d_arr2)), number=num_times)
        print("cuda vectorize time = " + format_t(avg_t(t, num_times)))

        t= timeit.repeat((lambda: mult_gpu_2d(d_arr1, d_arr2, d_result)), number=num_times)
        print("cuda_mult_2d time = " + format_t(avg_t(t, num_times)))

        print()
        print("Testing sum times")
        print("------------------")

        t = timeit.repeat((lambda: arr1 + arr2), number=num_times)
        print("cpu/numpy time = " + format_t(avg_t(t, num_times)))

        t = timeit.repeat((lambda: add_gpu(d_arr1, d_arr2)), number=num_times)
        print("cuda vectorize time = " + format_t(avg_t(t, num_times)))

        t= timeit.repeat((lambda: add_gpu_2d(d_arr1, d_arr2, d_result)), number=num_times)
        print("cuda_add_2d time = " + format_t(avg_t(t, num_times)))

        print()
        print("Testing reduction times")
        print("-----------------------")

        t = timeit.repeat((lambda: arr1.sum()), number=num_times)
        print("cpu/numpy time = " + format_t(avg_t(t, num_times)))

        t = timeit.repeat((lambda: add_gpu.reduce(d_arr1.ravel())), number=num_times)
        print("cuda vectorize time = " + format_t(avg_t(t, num_times)))

        t = timeit.repeat((lambda: sum_gpu(d_arr1.ravel())), number=num_times)
        print("sum_gpu time = " + format_t(avg_t(t, num_times)))
        print()

Running the test functions

    numba.cuda.detect()
    test_1d_times(2**24)
    test_2d_times(2**12)
    test_1d_times(2**24, dtype=np.float64)
    test_2d_times(2**12, dtype=np.float64)

gives the following output:

Found 1 CUDA devices
id 0    b'GeForce GTX TITAN X'                              [SUPPORTED]
                      compute capability: 5.2
                           pci device id: 0
                              pci bus id: 3
Summary:
    1/1 devices are supported
============================================================================
Testing 1D Data, Data length = 16777216, data type = <class 'numpy.float32'>
============================================================================
Data generated in 88.2 ms
d_arr1 dtype = float32
d_arr1 size =  16777216

Testing multiplication times
----------------------------
cpu/numpy time = 35.8 ms
cuda vectorize time = 122.8 ms
cuda_mult_1d time = 206.8 us

Testing sum times
------------------
cpu/numpy time = 35.8 ms
cuda vectorize time = 106.1 ms
cuda_add_1d time = 212.6 us

Testing reduction times
-----------------------
cpu/numpy time = 16.7 ms
cuda vectorize time = 11.1 ms
sum_gpu time = 127.3 ms

========================================================================
Testing 2D Data, Data length = 4096, data type = <class 'numpy.float32'>
========================================================================
Data generated in 0.0800013542175293 seconds
d_arr1 dtype = float32
d_arr1 size =  16777216

Testing multiplication times
----------------------------
cpu/numpy time = 35.4 ms
cuda vectorize time = 97.9 ms
cuda_mult_2d time = 208.9 us

Testing sum times
------------------
cpu/numpy time = 36.3 ms
cuda vectorize time = 94.5 ms
cuda_add_2d time = 250.8 us

Testing reduction times
-----------------------
cpu/numpy time = 16.4 ms
cuda vectorize time = 15.8 ms
sum_gpu time = 125.4 ms

============================================================================
Testing 1D Data, Data length = 16777216, data type = <class 'numpy.float64'>
============================================================================
Data generated in 171.0 ms
d_arr1 dtype = float64
d_arr1 size =  16777216

Testing multiplication times
----------------------------
cpu/numpy time = 73.2 ms
cuda vectorize time = 114.9 ms
cuda_mult_1d time = 201.9 us

Testing sum times
------------------
cpu/numpy time = 71.4 ms
cuda vectorize time = 71.0 ms
cuda_add_1d time = 217.2 us

Testing reduction times
-----------------------
cpu/numpy time = 29.0 ms
cuda vectorize time = 12.8 ms
sum_gpu time = 123.5 ms

========================================================================
Testing 2D Data, Data length = 4096, data type = <class 'numpy.float64'>
========================================================================
Data generated in 0.301849365234375 seconds
d_arr1 dtype = float64
d_arr1 size =  16777216

Testing multiplication times
----------------------------
cpu/numpy time = 73.7 ms
cuda vectorize time = 84.2 ms
cuda_mult_2d time = 226.2 us

Testing sum times
------------------
cpu/numpy time = 74.9 ms
cuda vectorize time = 84.3 ms
cuda_add_2d time = 208.7 us

Testing reduction times
-----------------------
cpu/numpy time = 29.9 ms
cuda vectorize time = 14.3 ms
sum_gpu time = 121.2 ms

It seems like the @cuda.vectorize decorated functions perform slower than the CPU and custom written @cuda.jit functions. While the @cuda.jit functions give the expected orders of magnitude speed up and nearly constant time performance (results not shown).

On the other hand the @cuda.reduce function runs significantly slower than either the @cuda.vectorize function or the CPU function.

Is there a reason for the poor performance of the @cuda.vectorize and @cuda.reduce functions? Is it possible to write a CUDA reduction kernel using just Numba?

@sklam
Copy link
Member

sklam commented Feb 3, 2017

Thanks for the bug report.

The reason for the slower cuda vectorize is that each call is allocating a device array for the output. As for the cuda.jit version, it is provided with a output array instead. You can provide the output array with the out argument; like: add_gpu(d_arr1, d_arr2, out=d_result). Note that even with explicit output array, the vectorize version will be a bit slower than the cuda.jit version.

As for the slow down with cuda.reduce, I have found that it is not caching the compiled code properly, so it is compiling every time it is called. I will open an issue for that.

@seibert
Copy link
Contributor

seibert commented Oct 31, 2017

Closing as #2268 is tracking the issue.

@seibert seibert closed this as completed Oct 31, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants