Skip to content

Latest commit

 

History

History
243 lines (179 loc) · 11.4 KB

notes_4_0.rst

File metadata and controls

243 lines (179 loc) · 11.4 KB

Release Notes for PoCL 4.0

Major new features

Support for Clang/LLVM 16.0

PoCL now supports Clang/LLVM from 10.0 to 16.0 inclusive. The most PoCL-relevant change of the new 16.0 release is support for _Float16 type on x86 and ARM targets.

CPU driver

Support for program-scope variables

Global variables in program-scope are now supported, along with static global variables in function-scope, for both OpenCL C source and SPIR-V compilation. The implementation passes the basic/test_basic test of the OpenCL-CTS, and has been tested with client applications through chipStar.

global float testGlobalVar[128];

__kernel void test1 (__global const float *a) {
  size_t i = get_global_id(0) % 128;
  testGlobalVar[i] += a[i];
}

__kernel void test2 (__global const float *a) {
  size_t i = get_global_id(0) % 128;
  testGlobalVar[i] *= a[i];
}

__kernel void test3 (__global float *out) {
  size_t i = get_global_id(0) % 128;
  out[i] = testGlobalVar[i];
}

Support for generic address space

Generic AS is now supported, for both OpenCL C source and SPIR-V compilation. PoCL now passes the generic_address_space/test_generic_address_space test of the OpenCL-CTS, and has been tested with CUDA/HIP applications through chipStar.

int isOdd(int *val) {
  return val[0] % 2;
}

__kernel void test3 (__global int *in1, __local int *in2, __global int *out) {
  size_t i = get_global_id(0);
  out[i] = isOdd(in1+i) + isOdd(in2+(i % 128)];
}

Initial support for cl_khr_subgroups

The default is a single subgroup that always executes the whole X-dimension's WIs. Independent forward progress is not yet supported, but it's not needed for CTS compliance, due to the corner case of only one SG in flight.

Additionally, there is partial implementation for cl_khr_subgroup_shuffle, cl_intel_subgroups and cl_khr_subgroup_ballot with caveats:

  • cl_khr_subgroup_shuffle: Passes the CTS, but only because it doesn't test non-uniform(lock-step) behavior, see: KhronosGroup/OpenCL-CTS#1236
  • cl_khr_subgroup_ballot: sub_group_ballot() works for uniform calls, the rest are unimplemented.
  • cl_intel_subgroups: The block reads/writes are unimplemented.

Initial support for cl_intel_required_subgroup_size

This extension allows the programmer to specify the required subgroup size for a kernel function. This can be important for algorithm correctness in some cases. It's used by chipStar to implement fixed width warps when needed. The programmer can specify the size with a new kernel attribute: __attribute__((intel_reqd_sub_group_size(<int>)))

PoCL additionally implements CL_DEVICE_SUB_GROUP_SIZES_INTEL parameter for clGetDeviceInfo API, however CL_​KERNEL_​SPILL_​MEM_​SIZE_​INTEL and CL_​KERNEL_​COMPILE_​SUB_​GROUP_​SIZE_​INTEL for clGetKernelWorkGroupInfo API are not yet implemented.

Initial support for cl_khr_fp16

PoCL now has partial support for cl_khr_fp16 when compiled with Clang/LLVM 16+. The implementation relies on Clang, and may result in emulation (promoting to fp32) if the CPU does not support the required instruction set. In Clang/LLVM 16+, the following targets have native fp16 support: 32-bit and 64-bit ARM (depending on vendor), x86-64 with AVX512-FP16. Currently only implemented for a part of builtin library functions, those that are implemented with either an expression, or a Clang builtin.

Level Zero driver

This is a new experimental driver that supports devices accessible via Level Zero API.

The driver has been tested with multiple devices (iGPU and dGPU), and passes a large portion of PoCL tests (87% tests passed, 32 tests fail out of 254), however it has not been finished nor optimized yet, therefore it cannot be considered production quality.

The driver supports the following OpenCL extensions, in addition to atomics: cl_khr_il_program, cl_khr_3d_image_writes, cl_khr_fp16, cl_khr_fp64, cl_khr_subgroups, cl_intel_unified_shared_memory. In addition, Specialization Constants and SVM are supported.

We also intend to use the driver for prototyping features not found in the official Intel Compute Runtime OpenCL drivers, and for experimenting with asynchronous execution with other OpenCL devices in the same PoCL platform. One such feature currently implemented is the JIT kernel compilation, which is useful with programs that have thousands of kernels but only launch a few of them (e.g. when using SPIR-V IL produced from heavily templated C++ code). For details, see the full driver documentation in doc/sphinx/source/level0.rst.

Support for cl_intel_unified_shared_memory

This extension, together with SPIR-V support and other new features, allows using PoCL as an OpenCL backend for SYCL runtimes. This works with the both CPU driver (tested on x86-64 & ARM64) and the Level Zero driver. Vincent A. Arcila has contributed a guide for building PoCL as SYCL runtime backend on ARM.

Additionally, there is a new testsuite integrated into PoCL for testing USM support, intel-compute-samples. These are tests from https://github.com/intel/compute-samples and PoCL currently passes 78% of the tests (12 tests failed out of 54).

New testsuites

There are also multiple new CTest testsuites in PoCL. For testing PoCL as a SYCL backend, there are three new testsuites: dpcpp-book-samples, oneapi-samples and simple-sycl-samples.

For testing PoCL as chipStar's OpenCL backend: chipStar testsuite. This builds the runtime and the tests from https://github.com/CHIP-SPV/chipStar, and runs a subset of tests (approximately 800) with PoCL as the chipStar's backend.

Mac OS X support

Thanks to efforts of Isuru Fernando who stepped up to become the official Mac OSX port maintainer, PoCL's CPU driver has been again fixed to work on Mac OS X. The current 4.0 release has been tested on these configurations:

MacOS 10.13 (Intel Sandybridge), MacOS 11.7 Intel (Ivybridge) with Clang 15.

Additionally, there are now Github Actions for CI testing of PoCL with Mac OS X, testing 4 different configurations: LLVM 15 and 16, with and without ICD loader.

Github Actions

The original CI used by PoCL authors (Python Buildbot, https://buildbot.net) has been converted to publicly accessible Github Actions CI. These are currently set up to test PoCL with last two LLVM versions rigorously, and basic tests with older LLVM versions. The most tested driver is the CPU driver, with multiple configurations enabling or testing different features: sanitizers, external testsuites, SYCL support, OpenCL conformance, SPIR-V support. There are also basic tests for other experimental/WiP/research-drivers in PoCL: OpenASIP, Vulkan, CUDA, and LevelZero.

Bugfixes and minor features

  • CMake: it's now possible to disable libhwloc support even when it's present, using -DENABLE_HWLOC=0 CMake option
  • AlmaIF's OpenASIP backend now supports a standalone mode. It generates a standalone C program from a kernel launch, which can then be compiled and executed with ttasim or RTL simulation.
  • Added a user env POCL_BITCODE_FINALIZER that can be used to call a custom script that manipulates the final bitcode before passing it to the code generation.
  • New alternative work-group function mode for non-SPMD from Open SYCL: Continuation-based synchronization is somewhat more general than the default one in PoCL's current kernel compiler, but allows for fewer hand-rolled optimizations. CBS is expected to work for kernels that PoCL's current kernel compiler does not support. Currently, CBS can be manually enabled by setting the environment variable POCL_WORK_GROUP_METHOD=cbs.
  • Linux/x86-64 only: SIGFPE handler has been changed to skip instructions causing division-by-zero, only if it occured in one of the CPU driver threads; so division-by-zero errors are no longer hidden in user threads.
  • CUDA driver: POCL_CUDA_VERIFY_MODULE env variable has been replaced by POCL_LLVM_VERIFY
  • CUDA driver: compilation now defaults to -ffp-contract=fast, previously it was -ffp-contract=on.
  • CUDA driver: support for Direct Peer-to-Peer buffer migrations This allows much better performance scaling in multi-GPU scenarios
  • OpenCL C: -cl-fast-relaxed-math now defaults to -ffp-contract=fast, previously it was -ffp-contract=on.
  • CPU drivers: renamed 'basic' to 'cpu-minimal' and 'pthread' driver to 'cpu', to reflect the hardware they're driving instead of implementation details.
  • CPU drivers: POCL_MAX_PTHREAD_COUNT renamed to POCL_CPU_MAX_CU_COUNT; the old env. variable is deprecated but still works
  • CPU drivers: Added a new POCL_CPU_LOCAL_MEM_SIZE environment for overriding the local memory size.
  • CPU drivers: OpenCL C printf() flushes output after each call instead of waiting for the end of the kernel command. This makes it more useful for debugging kernel segfaults.