## A taste of PyOpenCL

In this module we demonstrate how to run a small PyOpenCL example in an Azure Notebook. The example is based on the short example available at https://documen.tician.de/pyopencl/. As platform we use [POCL](http://portablecl.org/), which provides an OpenCL compliant CPU driver. First, we need to install PyOpenCL and POCL.

In [None]:
!conda install -c conda-forge --yes --verbose pyopencl pocl

We can now import pyopencl.

In [12]:
import pyopencl as cl

The following command lists the available OpenCL platforms.

In [13]:
for platform in cl.get_platforms():
    print(platform.name)

Apple


The only available platform is POCL. Let us now create a context. If only one platform is present this will be used automatically by the following command. Otherwise, the user needs to choose between the available platforms.

In [14]:
ctx = cl.create_some_context()

We can now check the available devices within the context.

In [15]:
for dev in ctx.devices:
    print(dev.name)

Intel(R) Core(TM) i5-5257U CPU @ 2.70GHz


It seems that Azure provides us with an Intel Xeon E5-2673 CPU, a pretty good compute environment. Let us check the number of available compute units.

In [16]:
dev = ctx.devices[0]
print(dev.max_compute_units)

4


There are two compute units available.

We want to send some commands to the device. For this we need a command queue.

In [17]:
queue = cl.CommandQueue(ctx)

We want to send two Numpy arrays to the device. We first define the arrays on the host.

In [18]:
import numpy as np

a_np = np.random.rand(50000).astype(np.float32)
b_np = np.random.rand(50000).astype(np.float32)

We note that both arrays use single precision floating point numbers. GPUs are usually much faster for single-precision computations. The difference is much less on CPUs.

We now create two memory buffers on the GPU that will hold the content of the Numpy arrays.

In [19]:
mf = cl.mem_flags
a_g = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=a_np)
b_g = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=b_np)

Note that there is no guarantee from OpenCL that a copy operation will already be performed in this step. However, it will be done once a kernel requires the corresponding buffers. Also, note that PyOpenCL automatically creates buffers that have the right size to hold the data contained in the Numpy arrays.

The kernel is defined in the following string.

In [20]:
prg = cl.Program(ctx, """
__kernel void sum(
    __global const float *a_g, __global const float *b_g, __global float *res_g)
{
  int gid = get_global_id(0);
  res_g[gid] = a_g[gid] + b_g[gid];
}
""").build()

RuntimeError: clBuildProgram failed: BUILD_PROGRAM_FAILURE - clBuildProgram failed: BUILD_PROGRAM_FAILURE - clBuildProgram failed: BUILD_PROGRAM_FAILURE

Build on <pyopencl.Device 'Intel(R) Core(TM) i5-5257U CPU @ 2.70GHz' on 'Apple' at 0xffffffff>:


(options: -I /usr/local/bin/miniconda3/envs/myenv/lib/python3.7/site-packages/pyopencl/cl)
(source saved as /var/folders/yy/h6q54vk950b5dn390by88t_r0000gn/T/tmpliptei45.cl)

We still need a result buffer and a corresponding Numpy representation.

In [21]:
res_g = cl.Buffer(ctx, mf.WRITE_ONLY, a_np.nbytes)
res_np = np.empty_like(a_np)

We will now run the kernel in the queue and copy the result back. Here, we use the queue as a context manager, which automatically calls *finish* at the end to make sure that all commands have been completed.

In [22]:
with queue:
    prg.sum(queue, a_np.shape, None, a_g, b_g, res_g)
    cl.enqueue_copy(queue, res_np, res_g)

LogicError: when processing argument #1 (1-based): clSetKernelArg failed: INVALID_MEM_OBJECT

Finally, we create a buffer on the host and copy the result back so that we can read it.

That's it. We have run our first OpenCL example. While this was run on a CPU it will work on any OpenCL compliant device, including GPUs by all vendors, FPGAs, mobile processors, etc.