In [None]:
from numba import cuda

print(cuda.is_available())

print(cuda.detact())

print(cuda.gpus.current())

1. GPU programming in Numba using ufuncs

In [9]:
target = 'parallel'

In [10]:
from numba import vectorize

@vectorize(['float64(float64, float64, float64)'], target=target)
def add(x, y):
    return x + y


In [12]:
import numpy as np
x = np.random.uniform(size=100).astype(np.float64)
y = np.random.uniform(size=100).astype(np.float64)


In [13]:

%timeit x + y

14.7 ms ± 1.17 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [14]:
%timeit add(x, y)


11 ms ± 120 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


Well, the GPU seems to be much slower! There several things to consider when using the GPU to accelerate your code:

* Input size: Maybe the input is not large enough to keep the very large number of GPU cores busy?
* Too simple: Maybe the computation is not heavy enough, or does not involve math operations?
* Datatypes not ideal: GPU hardware is usually way less efficient ( 2x to 25x) in running float64 operations than it is in running float32 operations.
* Data copying: Maybe we include the time needed to copy the data to the GPU and back to host memory? Is there a way to avoid it?

In [None]:
@vectorize(['float32(float32, float32, float32)',
            'float64(float64, float64, float64)'], target=target)
def saxpy(x, y, a):
    return a * x + y


In [None]:
x = np.random.uniform(size=1000000).astype(np.float32)
y = np.random.uniform(size=1000000).astype(np.float32)
a = 0.5


In [None]:
%timeit a*x + y


In [None]:
%timeit saxpy(x, y, a)


A significant improvement, even including the CPU-GPU data copying. 

2. CUDA Python kernels