# Exercise 2b - PyOpenCL demo
The goal of this demo is show the basic usecase of PyOpenCL. </br>
We will perform a simple elementwise operation on the GPU in two ways: 
- using PyOpenCL's array API
- manually by building and calling a C kernel.

In [None]:
import numpy as np
import pyopencl as cl
import pyopencl.array as cl_array

In PyopenCL we have to define our context manually. In the simplest case this means we select a platform, then a device, then initialize a context on this device.

In [None]:
platforms_list = cl.get_platforms()
platforms_list

In [None]:
devices_list = platforms_list[0].get_devices()
devices_list

In [None]:
# the total amount of GPU memory in bytes
devices_list[0].global_mem_size

In [None]:
context = cl.Context(devices=devices_list)
context

In [None]:
queue = cl.CommandQueue(context)
queue

Initialize an input array on the CPU

In [None]:
x_cpu = np.random.rand(int(1e6))

Transfer this array to the GPU. Specify the command queue in the first argument.

In [None]:
x_gpu = cl_array.to_device(queue, x_cpu)

PyOpenCL arrays work like NumPy arrays

In [None]:
print(type(x_cpu), type(x_gpu))

In [None]:
x_gpu.shape

You can find the PyOpenCL equivalent of each NumPy math operation [here](https://documen.tician.de/pyopencl/array.html).

In [None]:
import pyopencl.clmath as clmath

An elementwise operation

In [None]:
y_gpu = 2 * clmath.sin(x_gpu) + clmath.exp(x_gpu)
print(type(y_gpu))

A reduction operation.

In [None]:
z_cpu = np.sum(x_cpu)
z_gpu = cl_array.sum(x_gpu)
print(z_cpu, type(z_cpu), z_gpu, type(z_gpu))

Transfer data back to the CPU

In [None]:
y_cpu = y_gpu.get()
z_cpu = z_gpu.get()
print(type(y_cpu), type(z_cpu))

We can do the same using manually defined low level C kernels

__TODO:__ try to understand each line of the kernel, especially the indexing of the threads: `int i = get_global_id(0);`. How is it different compared to the CUDA kernel from Exercise 2a? Try to rewrite this line using OpenCL's local thread indexers: `get_group_id(0)`, `get_local_size(0)` and `get_local_id(0)`.

In [None]:
source_str = r"""
__kernel
void elementwise(
    __global const double* x, 
    __global double* y)
{
    int i = get_global_id(0);
    y[i] = 2 * sin(x[i]) + exp(x[i]);
}
"""

Build and load the kernel function

In [None]:
prg = cl.Program(context, source_str).build()
elementwise_kernel = prg.elementwise

Define an output array on the GPU

In [None]:
y_gpu_2 = cl_array.zeros_like(x_gpu)

To call the kernel function we define the global grid size and the size of the workgroup (equivalent to thread blocks in CUDA)

In [None]:
grid_size = len(x_cpu)
workgroup_size = 1
elementwise_kernel(queue, (grid_size,), (workgroup_size,), x_gpu.data, y_gpu_2.data)

Check that the two outputs (using the API and the C kernel) are the same.

In [None]:
y_cpu_2 = y_gpu_2.get()

In [None]:
np.allclose(y_cpu, y_cpu_2)

## Profiling

To profile the PyOpenCL code we have to enable it as a special property when creating the command queue

In [None]:
queue = cl.CommandQueue(context, properties=cl.command_queue_properties.PROFILING_ENABLE)

Create the input and output buffers

In [None]:
prg = cl.Program(context, source_str).build()
x_buffer = cl_array.to_device(queue, x_cpu)
y_buffer = cl_array.zeros_like(x_buffer)

Set the grid and workgroup size

In [None]:
grid_size = len(x_cpu)
workgroup_size = 1

To profile the kernel execution we mark the operation with an event (=a marker object in OpenCL, storing the state of the operation). Then we use `event.wait()` to synchronize the threads executing the command. Finally the timing information can be retrieved from the event object by `event.profile.start` and `event.profile.end`, which contain the GPU time in nanoseconds.

__TODO:__ try to tune the grid and workgroup size to make the code run as fast as possible. (You might see different runtimes compared to the CuPy exercise due to the differences in the execution time of the math operations `sin` and `exp` by CUDA and OpenCL.)

In [None]:
event = prg.elementwise(queue, (grid_size,), (workgroup_size,), x_buffer.data, y_buffer.data)
event.wait()  # synchronize
elapsed = (event.profile.end - event.profile.start)*1e-3 # convert from [ns] to [us]
print(f"GPU kernel time: {int(elapsed)} us")