# Exercise 2a - CuPy demo
The goal of this demo is show the basic usecase of CuPy. </br>
We will perform a simple elementwise operation on the GPU in two ways: 
- using CuPy's array API
- manually by building and calling a C kernel.

In [None]:
import numpy as np
import cupy as cp

Create an object representing the GPU device

In [None]:
device_gpu = cp.cuda.Device(0)

In [None]:
# the total amount of GPU memory in bytes
device_gpu.mem_info

Initialize an input array on the CPU

In [None]:
x_cpu = np.random.rand(int(1e6))

Transfer this array to the GPU

In [None]:
x_gpu = cp.array(x_cpu)

CuPy arrays work like NumPy arrays

In [None]:
print(type(x_cpu), type(x_gpu))

In [None]:
x_gpu.shape

You can find the CuPy equivalent of each NumPy math operation [here](https://docs.cupy.dev/en/stable/reference/comparison.html#numpy-cupy-apis).

An elementwise operation

In [None]:
y_gpu = 2 * cp.sin(x_gpu) + cp.exp(x_gpu)
print(type(y_gpu))

A reduction operation.

In [None]:
z_cpu = np.sum(x_cpu)
z_gpu = cp.sum(x_gpu)
print(z_cpu, type(z_cpu), z_gpu, type(z_gpu))

Transfer data back to the CPU

In [None]:
y_cpu = y_gpu.get()
z_cpu = z_gpu.get()
print(type(y_cpu), type(z_cpu))

We can do the same using manually defined low level C kernels

In [None]:
source_str = r"""
extern "C"{
__global__
void elementwise(const double* x, 
                 double* y)
{
    int i = blockDim.x * blockIdx.x + threadIdx.x;
    y[i] = 2 * sin(x[i]) + exp(x[i]);
}
}
"""

Build and load the kernel function

In [None]:
module = cp.RawModule(code=source_str)
elementwise_kernel = module.get_function("elementwise")

Define an output array on the GPU

In [None]:
y_gpu_2 = cp.zeros_like(x_gpu)

To call the kernel function we define the thread block size and the number of blocks.

In [None]:
blocksize = 1
n_blocks = int(np.ceil(len(x_cpu) / blocksize))  # grid has len(x) threads grouped into blocks

In [None]:
elementwise_kernel(grid=(n_blocks,), block=(blocksize,), args=(x_gpu, y_gpu_2))

Check that the two outputs (using the API and the C kernel) are the same.

In [None]:
cp.allclose(y_gpu, y_gpu_2)

#### Profiling

CuPy has its own benchmark utility for simple timing tests that spares us from having to use synchronization barriers manually.

In [None]:
from cupyx.profiler import benchmark

We can us it by first wrapping our CuPy operation in a Python function.

In [None]:
def elementwise(n_blocks, blocksize, x, y):
    elementwise_kernel(grid=(n_blocks,), block=(blocksize,), args=(x, y))

Define block and grid parameters, then time the execution of our elementwise kernel. Do 100 iterations for reasonable timing statistics.

In [None]:
blocksize = 1
n_blocks = int(np.ceil(len(x_cpu) / blocksize))

x_gpu = cp.array(x_cpu)
y_gpu = cp.zeros_like(x_gpu)
print(benchmark(elementwise, (n_blocks, blocksize, x_gpu, y_gpu), n_repeat=100))