# Exercise 2b - PyOpenCL demo
The goal of this exercise is show the basic usecase of PyOpenCL. </br>
We will perform a simple elementwise operation on the GPU in two ways: 
- using PyOpenCL's array API
- manually by building and calling a C kernel.

In [None]:
import numpy as np
import pyopencl as cl

In [None]:
cl.array

In PyopenCL we have to define our context manually. In the simplest case this means we select a platform, then a device, then initialize a context on this device.

In [None]:
import pyopencl as cl
platforms_list = cl.get_platforms()

In [None]:
device_gpu = cp.cuda.Device(0)
device_gpu.attributes

In [None]:
# the total amount of memory in bytes
device_gpu.mem_info

In [None]:
15843721216 / 1024 / 1024 / 1024

Initialize an input array on the CPU

In [None]:
x_cpu = np.random.rand(10)

Transfer this array to the GPU

In [None]:
x_gpu = cp.array(x_cpu)

CuPy arrays work like NumPy arrays

In [None]:
print(type(x_cpu), type(x_gpu))

In [None]:
x_gpu.shape

In [None]:
x_gpu[::5]

You can find the CuPy equivalent of each NumPy math operation [here](https://docs.cupy.dev/en/stable/reference/comparison.html#numpy-cupy-apis).

An elementwise operation

In [None]:
y_gpu = 2 * cp.sin(x_gpu) + cp.exp(x_gpu)
print(type(y_gpu))

A reduction operation. Note that CuPy returns a one dimensional array while NumPy returns a floating point number.

In [None]:
z_cpu = np.sum(x_cpu)
z_gpu = cp.sum(x_gpu)
print(z_cpu, type(z_cpu), z_gpu, type(z_gpu))

Transfer data back to the CPU

In [None]:
y_cpu = y_gpu.get()
z_cpu = z_gpu.get()
print(type(y_cpu), type(z_cpu))

We can do the same using manually defined low level C kernels

In [None]:
source_str = r"""
extern "C"{
__global__
void elementwise(const double* x, 
                 double* y)
{
    int i = blockDim.x * blockIdx.x + threadIdx.x;
    int j = blockDim.y * blockIdx.y + threadIdx.y;
    y[i] = 2 * sin(x[i]) + exp(x[i]);
}
}
"""

Build and load the kernel function

In [None]:
module = cp.RawModule(code=source_str)
elementwise_kernel = module.get_function("elementwise")

Define an output array on the GPU

In [None]:
y_gpu_2 = cp.zeros_like(x_gpu)

To call the kernel function we define the thread block size and the number of blocks.

In [None]:
blocksize = 1
n_blocks = int(np.ceil(len(x_cpu) / blocksize))  # grid has len(x) threads grouped into blocks

In [None]:
elementwise_kernel(grid=(n_blocks,), block=(blocksize,), args=(x_gpu, y_gpu_2))

Check that the two outputs (using the API and the C kernel) are the same.

In [None]:
cp.allclose(y_gpu, y_gpu_2)

#### Profiling

In [None]:
from cupyx.profiler import benchmark

In [None]:
def elementwise(n_blocks, blocksize, x, y):
    elementwise_kernel(grid=(n_blocks,), block=(blocksize,), args=(x, y))

In [None]:
blocksize = 4
n_blocks = int(np.ceil(len(x_cpu) / blocksize))
print(benchmark(elementwise, (n_blocks, blocksize, x_gpu, y_gpu), n_repeat=100))

In [None]:
import numpy as np
from time import time
import pyopencl as cl
import pyopencl.array as cl_array
import pyopencl.characterize.performance as perf

In [None]:
src_twice = r"""
__kernel void mul(
    __global const double* x, 
    __global double* y,
{
    int i = get_global_id(0);
    int j;
    for(j=0; j<1000; j++)
    { 
        y[i] = 2 * x[i];
    }    
}"""      

In [None]:
x = np.random.rand(100)  

In [None]:
cl.get_platforms()

In [None]:
ctx = cl.create_some_context(interactive=False)

In [None]:
queue = cl.CommandQueue(ctx, properties=cl.command_queue_properties.PROFILING_ENABLE)

In [None]:
prg = cl.Program(ctx,src_twice).build()
x_buffer = cl.Buffer(ctx, cl.mem_flags.READ_ONLY | cl.mem_flags.COPY_HOST_PTR, hostbuf=x)
y_buffer = cl.Buffer(ctx, cl.mem_flags.WRITE_ONLY, x.nbytes)

In [None]:
grid_size = len(x)
workgroup_size = 1 

In [None]:
event = prg.src_twice(queue, (grid_size,), (workgroup_size,), x_buffer, y_buffer)
event.wait()  # synchronize
elapsed = (event.profile.end - event.profile.start)*1e-3 # convert from [ns] to [us]
print(f"GPU kernel time: {int(elapsed)} us")