Testing OpenCL (.cl) kernels by loading them into Python within a Jupyter notebook and using pyOpenCL module

1. Load .cl File: Read the OpenCL kernel from a .cl file into a Python string using file I/O operations.

2. Initialize OpenCL: Use pyopencl to select an OpenCL platform and device, creating a context and command queue.

3. Compile Kernel: Compile the kernel code within the notebook environment, addressing any compilation issues.

4. Prepare Data: Allocate memory for input and output data on the device using OpenCL buffers and transfer input data to the device.

5. Execute Kernel: Enqueue the kernel for execution, specifying the number of work items and groups, and wait for completion.

6. Read Results: Retrieve output data from the device to the host and analyze for correctness.

7. Cleanup: Release OpenCL resources like buffers and context.

Install required pyopencl kernel

In [1]:
!pip install pyopencl

Collecting pyopencl
  Obtaining dependency information for pyopencl from https://files.pythonhosted.org/packages/3d/7c/d2a89b1c24c318375856e8b7611bc03ddf687134f68ddbb387496453eda8/pyopencl-2025.1-cp311-cp311-win_amd64.whl.metadata
  Downloading pyopencl-2025.1-cp311-cp311-win_amd64.whl.metadata (4.8 kB)
Collecting pytools>=2024.1.5 (from pyopencl)
  Obtaining dependency information for pytools>=2024.1.5 from https://files.pythonhosted.org/packages/5d/44/a5c139fc030c21c02ba77546ef6109a63fd448ec51ddc19b06c2a249ecec/pytools-2025.1.2-py3-none-any.whl.metadata
  Downloading pytools-2025.1.2-py3-none-any.whl.metadata (3.0 kB)
Downloading pyopencl-2025.1-cp311-cp311-win_amd64.whl (457 kB)
   ---------------------------------------- 0.0/457.9 kB ? eta -:--:--
   ------ --------------------------------- 71.7/457.9 kB 1.9 MB/s eta 0:00:01
   ---------------------------------------- 457.9/457.9 kB 5.7 MB/s eta 0:00:00
Downloading pytools-2025.1.2-py3-none-any.whl (92 kB)
   ----------------------

Test environment by printing the environment

Example output:

Number of platforms                      2
Platform Name                            Intel(R) OpenCL HD Graphics
Platform Vendor                          Intel(R) Corporation
Platform Version                         OpenCL 2.1

In [2]:
!clinfo

'clinfo' is not recognized as an internal or external command,
operable program or batch file.


Import python modules

In [3]:
import pyopencl as cl
import numpy as np

  warn("Unable to import recommended hash 'siphash24.siphash13', "


Initialize OpenCL: Use pyopencl to select an OpenCL platform and device

Example output:

[<pyopencl.Device 'Intel(R) Gen9 HD Graphics NEO' on 'Intel(R) OpenCL HD Graphics'

In [4]:
platforms = cl.get_platforms()
cpu_devices = [device for device in platforms[0].get_devices(device_type=cl.device_type.GPU)]
cpu_devices

[<pyopencl.Device 'NVIDIA GeForce RTX 3060 Laptop GPU' on 'NVIDIA CUDA' at 0x1fcff199050>]

Creating a context and command queue

In [5]:
context = cl.Context(devices=cpu_devices)

# Create a command queue for the target device
queue = cl.CommandQueue(context)

Load .cl File: Read the OpenCL kernel from a .cl file into a Python string using file I/O operations

In [8]:
file_name = "./device/matrix_mul.cl"  # Replace with the name of your uploaded .cl file
with open(file_name, 'r') as file:
    kernel_code = file.read()

Compile Kernel: Compile the kernel code

In [9]:
program = cl.Program(context, kernel_code).build()

  lambda: self._prg.build(options_bytes, devices),


Initialize inputs to test (test for random inputs) and output to zeros

In [10]:
input_tile_size = 16
output_neurons_tile_size = 10

# Initialize random data for the input tile and weights
input_tile = np.random.rand(input_tile_size).astype(np.float32)
weights_tile = np.random.rand(input_tile_size * output_neurons_tile_size).astype(np.float32)

output_tile = np.zeros(output_neurons_tile_size).astype(np.float32)

In [12]:
# Create memory buffers
input_tile_buf = cl.Buffer(context, cl.mem_flags.READ_ONLY | cl.mem_flags.COPY_HOST_PTR, hostbuf=input_tile)
weights_tile_buf = cl.Buffer(context, cl.mem_flags.READ_ONLY | cl.mem_flags.COPY_HOST_PTR, hostbuf=weights_tile)
output_tile_buf = cl.Buffer(context, cl.mem_flags.WRITE_ONLY, output_tile.nbytes)

# Build the kernel
program = cl.Program(context, kernel_code).build()

# Execute the kernel
global_size = (output_tile.size,)
local_size = None
program.matrixMul(queue, global_size, local_size,
               input_tile_buf, weights_tile_buf,
               np.int32(input_tile_size), np.int32(output_neurons_tile_size),
               output_tile_buf)

# Read the output buffer back to the host
cl.enqueue_copy(queue, output_tile, output_tile_buf)

# Output the results
print(output_tile)

[4.8587184 3.2308404 3.474527  3.8252118 3.2486756 4.1346297 4.413717
 4.404695  3.5091274 3.5865757]


In [13]:
def matrix_vector_multiply(input_tile, weights_tile, input_tile_size, output_neurons_tile_size):
    # Reshape weights_tile to be a 2D array for matrix multiplication
    weights_matrix = weights_tile.reshape((output_neurons_tile_size, input_tile_size))

    # Perform matrix-vector multiplication
    output_tile = np.dot(weights_matrix, input_tile)

    return output_tile

In [14]:
test_output = matrix_vector_multiply(input_tile, weights_tile, input_tile_size, output_neurons_tile_size)

In [16]:
test_output

array([4.858718 , 3.2308407, 3.474527 , 3.825212 , 3.2486756, 4.13463  ,
       4.4137173, 4.4046955, 3.5091276, 3.5865755], dtype=float32)