# Demo 2: An Introduction to PyOpenCL
## EECS E4750: Heterogenous Computing for Signal and Data Computing
Fall 2020

## Preface

Welcome to E4750! In this Jupyter Notebook, you will explore what may very well be your first encounter with OpenCL programming, through the popular Python wrapper - **PyOpenCL**.

While I am showcasing this demo through the use of a Jupyter Notebook, please keep in mind that assignments and even future recitations will eschew these - we will work exclusively with executable python scripts, not jupyter notebooks. Nevertheless, this and the PyCUDA demo will be available on the course repo if you need to revisit either.

## About OpenCL

Unlike CUDA, which is a proprietary API, OpenCL is a standard aimed broadly at every possible device/platform. As a result, multiple *implementations* of the OpenCL standard exist, which are provided by device vendors to support their platforms. So, Intel has its own implementation, as did AMD for its CPU/APUs. (Unfortunately, the highly anticipated Ryzen series don't seem to have OpenCL support, which is a real shame. The Radeon graphics products do still support OpenCL through ROCm, though). Nvidia too, ships OpenCL support with their CUDA toolkit.

While originally created by Apple, maintenance of the standard has long since moved into the care of the Khronos Group, a non-profit consortium including most big-name usual suspects in the computing world, along with several academic institutions. Apple meanwhile has shifted to an in-house proprietary API called Metal. Do not be fooled into thinking that OpenCL is therefore irrelevant, though! While it is true that resources for learning OpenCL are harder to find than for CUDA, being Open Source has its advantages.


## PyOpenCL

Just like PyCUDA, PyOpenCL is a Python wrapper for OpenCL, with the same scope (full support) and similar usage.

## Platform, Device and Context

Let's begin by exploring the OpenCL platform and the Context.

In [None]:
import pyopencl as cl

# Create a context
ctx = cl.create_some_context()

# Query and print available device(s)
device = ctx.devices[0]
print('Device name: ', device.name)

# Query platform and print
platforms = cl.get_platforms()
print('OpenCL device/platform name:', platforms[0].name)

# Device Global Memory
print('Global Memory Size: ', device.global_mem_size//1024**2, ' megabytes')

Device name:  GeForce RTX 2070
OpenCL device/platform name: NVIDIA CUDA
Global Memory Size:  7982  megabytes


So the platform refers to the vendor-specific OpenCL implementation, while contexts are used by the OpenCL runtime for managing objects such as command queues (the object that allows you to send commands to the device). These would also include memory, program, and kernel objects. A context facilitates kernels execution on one or more devices specified in that context.

Let's now check the number of Streaming Multiprocessors (SMs):

In [None]:
print('SM count: ', device.max_compute_units)


SM count:  36


### CUDA cores vs. Streaming Multiprocessors

SMs and CUDA cores are not the same. CUDA cores are more prominently showcased and therefore these are what you're familiar with. The GPU architecture is composed of a certain number of Streaming Multiprocessors (SM). In the case of the card in use above, there are 36 SMs. For this specific device architecture, Nvidia includes 64 CUDA cores per SM. While you can easily Google these details for the Nvidia GPU that you'll use in a Google Cloud VM, it'll be a good exercise to try these device queries and discover these details yourself.

## Demo PyOpenCL Program

A basic OpenCL program structure has 3 levels:
* Platform
    * query platform
    * query compute devices
    * create contexts

* Runtime
    * create memory objects associate the contexts
    * compile and create kernel program objects
    * issue commands to command queue
    * synchronization of commands
    * clean up OpenCL resources

* Kernels/Language layer
    * OpenCL C Kernel Code


Let's look at an example similar to the PyCUDA demo - to double a 2D array.

In [None]:
import numpy as np

a_np = np.random.rand(5).astype(np.float32)

Once again, the device FP precision limits the input variable to `float32`.

Next, create a context.

### Command Queue

The Command Queue highlights the advantage that OpenCL holds over CUDA. Unlike with CUDA, OpenCL contexts can be created for multiple devices. Command queues contain instructions to inform which of the devices of your Context (that is, the group of Devices you have chosen to use) is going to execute a particular command and also how it is going to do it.

In [None]:
ctx = cl.create_some_context()
queue = cl.CommandQueue(ctx)

The next step is to write the input array to a buffer object in device memory. You can read more about memory interactions in the [PyOpenCL documentation](https://documen.tician.de/pyopencl/runtime_memory.html). With the command below, you are writing the input array `a_np` into an on-device buffer with the memory flag `cl.mem_flags.COPY_HOST_PTR` - which tells the device to copy all values at the given address to device memory.

In [None]:
mf = cl.mem_flags
a_g = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=a_np)

### Kernel Code

You can define a kernel code and then build it using `cl.Program`. Below, the kernel code performs array doubling like in the PyCUDA tutorial.

In [None]:
prg = cl.Program(ctx, """
__kernel void doublify(
    __global const float *a_g, __global float *res_g)
{
  int gid = get_global_id(0);
  res_g[gid] = a_g[gid]*2;
}
""").build()

The next steps would be to call the built kernel function and give it the parameters it needs to run. You also need to define an additional buffer to write the result into. So this is created with the `WRITE_ONLY` flag. The function call requires you to specify which queue to use. In our case, we've defined one context and corresponding command queue, so that will go here.

In [None]:
res_g = cl.Buffer(ctx, mf.WRITE_ONLY, a_np.nbytes)
prg.doublify(queue, a_np.shape, None, a_g, res_g)

<pyopencl._cl.Event at 0x7fc9ca1e5f90>

### Copy result back to Host

Finally, you have to copy the result from the device memory buffer back to host, which is of course done through the same command queue.

In [None]:
res_np = np.empty_like(a_np)
cl.enqueue_copy(queue, res_np, res_g)

# Check on CPU with Numpy:
print(res_np)
print(a_np*2)

[1.1207193e+00 9.2828828e-01 1.7077771e+00 7.5119704e-01 1.4687362e-03]
[1.1207193e+00 9.2828828e-01 1.7077771e+00 7.5119704e-01 1.4687362e-03]


### Error Handling with `try/except` in Python

Here's an example on how to properly check for errors and handle specific errors. We'll use `try` and `except` to check if the results from the OpenCL kernel execution match standard numpy doubling.

In [None]:
# Error check
try:
    print("Checkpoint: Do python and opencl result match? Checking...")
    assert (res_np-a_np).all()
    baddouble = False   # results agree mutually
except AssertionError:
    print("Checkpoint failed: Python and opencl kernel result do not match. Try Again!")

    # perform some additional operation if assertion error occurs
    # operation in case of exception goes here

# if the interpreter gets this far, the error check passed!
print('Results match!')

Checkpoint: Do python and opencl result match? Checking...
Results match!
