# Awkward Array x NVIDIA
Awkward Array provides an API to perform calculations using NVIDIA GPUs if available. This provides an interface familiar to Awkward users without needing any new knowledge of CUDA programming. Below, I will show how simple it is to leverage GPUs as accelerators with awkward.

First, let's create an array of data on both the cpu and GPU by specifying the requested 'backend'.

In [1]:
import awkward as ak
import numpy as np
data = np.array([0,1,2,3,4,5,6,7,8,9,10])
array_CPU = ak.Array(data, backend = "cpu")
array_GPU = ak.Array(data, backend = "cuda")

In [2]:
array_GPU

Let's compare the performance between the backends.

In [3]:
%%timeit
# CPU
array_CPU**2

380 μs ± 2.03 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


In [4]:
%%timeit
# GPU
array_GPU**2

1.06 ms ± 252 μs per loop (mean ± std. dev. of 7 runs, 1 loop each)


But wait! Here we see that using the GPU is a fair bit slower. GPUs have high throughput, but they also have high latency. It takes time for the CPU to send instructions to the GPU, and copying the result back to the host (CPU) is an expensive operation. GPUs realize performance benefits when the computation time far exceeds the latency overhead. Let's compare again but with much more data (~4GB).

In [5]:
dtype = np.int32
shape = (500_000_000)
data = np.random.default_rng().integers(low=0,high=10,size = shape)
array_CPU = ak.Array(data, backend = "cpu")
array_GPU = ak.Array(data, backend = "cuda")

In [6]:
array_GPU

In [7]:
%%time
result = array_CPU**2

CPU times: user 505 ms, sys: 952 ms, total: 1.46 s
Wall time: 1.46 s


In [8]:
%%time
result = array_GPU**2

CPU times: user 5.37 ms, sys: 16.9 ms, total: 22.3 ms
Wall time: 28 ms


# Cupy

`Cupy` is a library which functions as an extension of `numpy` but for arrays stored on CUDA capable GPUs. Many of the numpy functions are implemented for `cupy`, and `cupy` provides an API which gives access to certain cuda functionalities (device synching, cuda streams).

In [9]:
import cupy as cp
import numpy as np
shape = (1000,1000)
data = np.random.default_rng().integers(low=0,high=100,size = shape)
data_cupy = cp.array(data)
data_cpu = np.array(data)

In [10]:
print(type(data_cupy))
print(data_cupy.data)

<class 'cupy.ndarray'>
<MemoryPointer 0x7f8a92800000 device=0 mem=<cupy.cuda.memory.PooledMemory object at 0x7f92e9751fb0>>


In [11]:
%%timeit
result = np.matmul(data_cpu, data_cpu)

716 ms ± 10.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [12]:
%%timeit
result = cp.matmul(data_cupy, data_cupy)

115 μs ± 62.8 μs per loop (mean ± std. dev. of 7 runs, 1 loop each)
