Skip to content

Kernel launch overhead #830

@99991

Description

@99991

Describe the bug

Launching a kernel via PyOpenCL is 20 times slower than launching the same kernel with the corresponding OpenCL functions via ctypes.

To Reproduce

I discovered this problem while trying to implement bitonic sort. Here is a demonstration which first invokes the kernel with PyOpenCL's call syntax and then switches to calling the same kernel directly via the OpenCL functions loaded from libOpenCL.so using ctypes.

Output with Intel Core i5-2400 CPU:

231.444 ms with PyOpenCL dispatch method
221.348 ms with PyOpenCL dispatch method
221.905 ms with PyOpenCL dispatch method
220.082 ms with PyOpenCL dispatch method
219.206 ms with PyOpenCL dispatch method
223.196 ms with PyOpenCL dispatch method
246.064 ms with PyOpenCL dispatch method
252.709 ms with PyOpenCL dispatch method
255.384 ms with PyOpenCL dispatch method
249.199 ms with PyOpenCL dispatch method
  8.495 ms with ctypes dispatch method
  8.350 ms with ctypes dispatch method
  8.540 ms with ctypes dispatch method
  7.898 ms with ctypes dispatch method
  7.979 ms with ctypes dispatch method
  7.343 ms with ctypes dispatch method
  7.300 ms with ctypes dispatch method
  9.271 ms with ctypes dispatch method
  7.459 ms with ctypes dispatch method
  7.111 ms with ctypes dispatch method

Output with NVIDIA V100 GPU:

 25.175 ms with PyOpenCL dispatch method
 24.560 ms with PyOpenCL dispatch method
 24.169 ms with PyOpenCL dispatch method
 23.987 ms with PyOpenCL dispatch method
 24.396 ms with PyOpenCL dispatch method
 23.612 ms with PyOpenCL dispatch method
 23.937 ms with PyOpenCL dispatch method
 24.184 ms with PyOpenCL dispatch method
 23.541 ms with PyOpenCL dispatch method
 23.887 ms with PyOpenCL dispatch method
  1.951 ms with ctypes dispatch method
  1.872 ms with ctypes dispatch method
  1.876 ms with ctypes dispatch method
  1.885 ms with ctypes dispatch method
  1.879 ms with ctypes dispatch method
  1.872 ms with ctypes dispatch method
  1.914 ms with ctypes dispatch method
  1.962 ms with ctypes dispatch method
  1.869 ms with ctypes dispatch method
  1.863 ms with ctypes dispatch method

Expected behavior

Should be faster. 🚀

Environment (please complete the following information):

Computer 1:

  • OS: Debian 12
  • ICD Loader and version: 2.3.1-1
  • ICD and version: pocl 1.5
  • CPU/GPU: Intel Core i5-2400 CPU
  • Python version: 3.10.12
  • PyOpenCL version: 2024.2.7

Computer 2:

  • OS: Debian 12
  • ICD Loader and version: 2.3.1-1
  • ICD and version: NVIDIA driver 535.247.01
  • CPU/GPU: NVIDIA V100 GPU
  • Python version: 3.12.3
  • PyOpenCL version: 2025.1

Additional context

A similar issue had been created in 2016, but the answer links to a now non-existent mailing list. Unfortunately, the mailing list had not been archived on https://web.archive.org

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions