<a href="https://colab.research.google.com/github/Kusla75/parallel-programming-workshop/blob/master/03_runtime.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Runtime and basic example demonstration

In this example simple kernel program with runtime code is showcased.</br>
This example shows how to send float32 array to the GPU and calculate in parallel $\sin(x)$ of each element in array.



In [None]:
!pip install pyopencl

## Kernel code

```c
__kernel void sinus(__global float* a) {
    int i = get_global_id(0);
    a[i] = sin(a[i]);
}
```

- `__kernel` - qualifier that declares that a function can be executed as kernel on a OpenCL device when called by host
- `__global` - indicates that memory object is allocated in global memory space
- `get_global_id(0)` - returns global work-item ID based on the number of global work-items specified to execute the kernel. Argument passed to this function specifies dimension eg. dimension 0

By convetion, file that have kernel programs inside have `.cl` extension.</br>
In order to run this example, `.cl` program must be created and stored in local directory of VM instance

In [2]:
!touch program.cl

## Runtime code

In [67]:
import numpy as np
import pyopencl as cl

np.random.seed(0)       # initializing seed for random function

Using NumPy we will generate array that will be send to the device to be processed.</br>
`np.random.rand()` function we will generate elements from [0,1) interval and we will convert them to float32.

In [85]:
size_of_array = 100
array = np.random.randn(size_of_array).astype(np.float32)
print(array)

result_cpu = np.sin(array)
print(result_cpu)

[ 1.88315070e+00 -1.34775901e+00 -1.27048504e+00  9.69396710e-01
 -1.17312336e+00  1.94362116e+00 -4.13618982e-01 -7.47454822e-01
  1.92294204e+00  1.48051476e+00  1.86755896e+00  9.06044662e-01
 -8.61225665e-01  1.91006494e+00 -2.68003374e-01  8.02456379e-01
  9.47251976e-01 -1.55010089e-01  6.14079356e-01  9.22206700e-01
  3.76425534e-01 -1.09940076e+00  2.98238188e-01  1.32638586e+00
 -6.94567859e-01 -1.49634540e-01 -4.35153544e-01  1.84926379e+00
  6.72294736e-01  4.07461822e-01 -7.69916058e-01  5.39249182e-01
 -6.74332678e-01  3.18305567e-02 -6.35846078e-01  6.76433265e-01
  5.76590836e-01 -2.08298758e-01  3.96006703e-01 -1.09306157e+00
 -1.49125755e+00  4.39391702e-01  1.66673496e-01  6.35031462e-01
  2.38314486e+00  9.44479465e-01 -9.12822247e-01  1.11701632e+00
 -1.31590736e+00 -4.61584598e-01 -6.82416037e-02  1.71334267e+00
 -7.44754851e-01 -8.26438546e-01 -9.84525234e-02 -6.63478315e-01
  1.12663591e+00 -1.07993150e+00 -1.14746869e+00 -4.37820047e-01
 -4.98032451e-01  1.92953

First we need to create a context and assign platform and device objects to that context. Available devices can be retrieved from platform object. Additionally, each program, buffer and command queue object must be assigned to some context.

In [86]:
platform = cl.get_platforms()[0]
gpu_device = platform.get_devices()[0]

context = cl.Context(
    devices=[gpu_device],
    properties=[(cl.context_properties.PLATFORM, platform)]
)

Each device needs it's own command queue. Command queue is used to call kernel execution, send and retrieve data from the device

In [87]:
queue = cl.CommandQueue(context, gpu_device)

Next we want to create a buffer object that will be a reference to data that is stored on a device

In [88]:
buffer = cl.Buffer(context, cl.mem_flags.READ_WRITE, array.nbytes)

Now we need to load our program file that has kernel function create program object and compile our kernel. After that we set arguments of our kernel this is done by passing buffers with data.

In [89]:
program_file = open("program.cl", "r")
program_src = program_file.read()

program = cl.Program(context, program_src)
program.build()                              # here the compilation process happens
kernel = program.sinus                       # here we link our kernel object

kernel.set_args(buffer)

Copying data from host memory to memory on the device and assigning buffer object to that block of memory.

In [90]:
cl.enqueue_copy(queue, buffer, array)

<pyopencl._cl.NannyEvent at 0x7feca01d0770>

Finally, we send kernel function to device and initiate work-items that will copy and run kernel.</br>
We will launch same number of work-items as number of elements in array.

In [91]:
global_work_size = (size_of_array,)
cl.enqueue_nd_range_kernel(queue, kernel, global_work_size, None)

<pyopencl._cl.Event at 0x7feca01d0b30>

Calculation is finished, so we need to return data from device memory to host. We create empty result array and copy results there.

In [94]:
result_gpu = np.empty(size_of_array, dtype=np.float32)
cl.enqueue_copy(queue, result_gpu, buffer)

print(result_gpu)

[ 0.9516127  -0.9752301  -0.9552444   0.82454455 -0.9219647   0.93130213
 -0.40192574 -0.6797743   0.9386348   0.9959274   0.9562882   0.78707004
 -0.75864166  0.94299835 -0.26480663  0.7190653   0.81181395 -0.15439007
  0.5762063   0.7969366   0.36759862 -0.89093536  0.29383662  0.9702802
 -0.64005345 -0.14907676 -0.42154965  0.9614778   0.622783    0.39628023
 -0.69607496  0.51349187 -0.6243762   0.03182518 -0.5938585   0.6260156
  0.5451691  -0.20679574  0.38573718 -0.8880387  -0.99683845  0.42538902
  0.16590287  0.5932028   0.6877955   0.8101919  -0.79123276  0.8987965
 -0.9676913  -0.44536743 -0.06818865  0.98985744 -0.6777916  -0.73552316
 -0.09829355 -0.615861    0.90297174 -0.8819255  -0.911727   -0.42396614
 -0.4776979   0.93634146  0.81307846  0.08743944 -0.9409534   0.74754816
 -0.8415873  -0.9996614   0.9276349   0.3116629   0.7961216   0.31335855
  0.7557709  -0.60600257 -0.85947555  0.63003206 -0.71972746 -0.6361899
 -0.43994057  0.01747827 -0.34664685 -0.9808836  -0.600

# Let's check if results are valid

We will do the same calculation and compare if results from cpu and gpu are approximately the same. 

In [95]:
result_cpu = np.sin(array)

are_same = np.allclose(result_cpu, result_gpu)
print("Results from CPU and GPU are same: ", are_same)

Results from CPU and GPU are same:  True
