# Runtime and basic example demonstration

In this example simple kernel program with runtime code is showcased.</br>
This example shows how to send float32 array to the GPU and calculate in parallel $\sin(x)$ of each element in array.



In [None]:
!pip install pyopencl

---

## Kernel code

In [None]:
%%writefile program.cl

__kernel void sinus(__global float* a) {
    int i = get_global_id(0);
    a[i] = sin(a[i]);
}

- `__kernel` - qualifier that declares that a function can be executed as kernel on a OpenCL device when called by host
- `__global` - indicates that memory object is allocated in global memory space
- `get_global_id(0)` - returns global work-item ID based on the number of global work-items specified to execute the kernel. Argument passed to this function specifies dimension eg. dimension 0

By convetion, file that have kernel programs inside have `.cl` extension.</br>
In order to run this example, `.cl` program must be created and stored in local directory of VM instance.
Command `%%writefile program.cl` will create file and write contents of cell in it.

## Runtime code

In [None]:
import numpy as np
import pyopencl as cl

np.random.seed(0)       # initializing seed for random function

Using NumPy we will generate array that will be send to the device to be processed.</br>
`np.random.rand()` function we will generate elements from [0,1) interval and we will convert them to float32.

In [None]:
size_of_array = 100
array = np.random.randn(size_of_array).astype(np.float32)
print(array)

First we need to create a context and assign platform and device objects to that context. Available devices can be retrieved from platform object. Additionally, each program, buffer and command queue object must be assigned to some context.

In [None]:
platform = cl.get_platforms()[0]
gpu_device = platform.get_devices()[0]

context = cl.Context(
    devices=[gpu_device],
    properties=[(cl.context_properties.PLATFORM, platform)]
)

Each device needs it's own command queue. Command queue is used to call kernel execution, send and retrieve data from the device

In [None]:
queue = cl.CommandQueue(context, gpu_device)

Next we want to create a buffer object that will be a reference to data that is stored on a device

In [None]:
buffer = cl.Buffer(context, cl.mem_flags.READ_WRITE, array.nbytes)

Now we need to load our program file that has kernel function create program object and compile our kernel. After that we set arguments of our kernel this is done by passing buffers with data.

In [None]:
program_file = open("program.cl", "r")
program_src = program_file.read()

program = cl.Program(context, program_src)
program.build()                              # here the compilation process happens
kernel = program.sinus                       # here we link our kernel object

kernel.set_args(buffer)

Copying data from host memory to memory on the device and assigning buffer object to that block of memory.

In [None]:
cl.enqueue_copy(queue, buffer, array)

Finally, we send kernel function to device and initiate work-items that will copy and run kernel.</br>
We will launch same number of work-items as number of elements in array.

In [None]:
global_work_size = (size_of_array,)
local_work_size = (size_of_array,)
cl.enqueue_nd_range_kernel(queue, kernel, global_work_size, local_work_size)

Calculation is finished, so we need to return data from device memory to host. We create empty result array and copy results there.

In [None]:
result_gpu = np.empty(size_of_array, dtype=np.float32)
cl.enqueue_copy(queue, result_gpu, buffer)

print(result_gpu)

# Let's check if results are valid

We will do the same calculation and compare if results from cpu and gpu are approximately the same. 

In [None]:
result_cpu = np.sin(array)

are_same = np.allclose(result_cpu, result_gpu)
print("Results from CPU and GPU are same: ", are_same)

# Some useful shortcuts

OpenCL can automatically create context and assign devices from available platforms.

In [None]:
ctx = cl.create_some_context()

When creating buffer object data can be automatically copied and assigned to that buffer object by passing memory flag `COPY_HOST_PTR` and `hostbuf`

In [None]:
mf = cl.mem_flags
array_buffer = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=array)

Setting kernel arguments and enqueueing work-items can be done throught one kernel call

In [None]:
kernel(queue, global_work_size, local_work_size, buffer)