Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Various features, fixes and hacks to make oneAPI-samples/Libraries/MPI/jacobian_solver SYCL run on PoCL-CPU and PoCL-R #1438

Merged
merged 8 commits into from Mar 15, 2024
3 changes: 2 additions & 1 deletion CMakeLists.txt
Expand Up @@ -1325,7 +1325,8 @@ cl_exp_pinned_buffers")
set(HOST_DEVICE_FEATURES_30 "__opencl_c_3d_image_writes __opencl_c_images \
__opencl_c_atomic_order_acq_rel __opencl_c_atomic_order_seq_cst \
__opencl_c_atomic_scope_device __opencl_c_program_scope_global_variables \
__opencl_c_atomic_scope_all_devices __opencl_c_generic_address_space")
__opencl_c_atomic_scope_all_devices __opencl_c_generic_address_space \
__opencl_c_work_group_collective_functions")

# Host CPU device: extensions only enabled when conformance is OFF
if(NOT ENABLE_CONFORMANCE)
Expand Down
28 changes: 17 additions & 11 deletions doc/sphinx/source/notes_6_0.rst
Expand Up @@ -2,14 +2,6 @@
Release Notes for PoCL 6.0
**************************



Minimal support for `cl_khr_priority_hints` and `cl_khr_throttle_hints` has been added.
As the extension specification states that these hints provide no guarantees of
any particular behavior (or lack thereof) they are treated as a no-op. However
specifying them no longer causes `clCreateCommandQueueWithProperties` to return
an error.

============================
New device driver: cpu-tbb
============================
Expand All @@ -18,6 +10,16 @@ The cpu-tbb device driver uses the Intel oneAPI Threading Building Blocks (oneTB
library for work-group and kernel-level task scheduling. Except for the
task scheduler, the driver is identical to the original 'cpu' driver (pthread).

=====================================
Command queue priority/throttle hints
=====================================

Minimal support for `cl_khr_priority_hints` and `cl_khr_throttle_hints` has been added.
As the extension specification states that these hints provide no guarantees of
any particular behavior (or lack thereof) they are treated as a no-op. However
specifying them no longer causes `clCreateCommandQueueWithProperties` to return
an error.

===========================
Driver-specific features
===========================
Expand All @@ -26,9 +28,13 @@ Driver-specific features
CPU driver
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The 'cpu' driver gained support for using OpenMP for thread scheduling.
Support is disabled by default, but can be enabled with CMake option. The
'cpu-minimal' driver does not support OpenMP.
* Support for using OpenMP for task scheduling was added. It is disabled
by default, but can be enabled with CMake option. The 'cpu-minimal'
driver does not support OpenMP since it's supposed to be single-threaded.
* The CPU drivers can be now used for running SYCL programs compiled with
the oneAPI binary distributions of DPC++ by adding the following environment
settings: **POCL_DRIVER_VERSION_OVERRIDE=2023.16.7.0.21_160000 POCL_CPU_VENDOR_ID_OVERRIDE=32902**.
* Added support for the **__opencl_c_work_group_collective_functions** feature.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Remote
Expand Down
16 changes: 15 additions & 1 deletion doc/sphinx/source/using.rst
Expand Up @@ -160,6 +160,14 @@ pocl.
'cpu' device driver. The default is to determine this from the number of
hardware threads available in the CPU.

- **POCL_CPU_VENDOR_ID_OVERRIDE**

Overrides the vendor id reported by PoCL for the CPU drivers.
For example, setting the vendor id to be 32902 (0x8086) and setting the driver
version using **POCL_DRIVER_VER_OVERRIDE** to "2023.16.7.0.21_160000" (or such) can
be used to convince binary-distributed DPC++ compilers to compile and run SYCL
programs on the PoCL-CPU driver.

- **POCL_DEBUG**

Enables debug messages to stderr. This will be mostly messages from error
Expand All @@ -170,7 +178,8 @@ pocl.

The old way (setting POCL_DEBUG to 1) has been updated to support categories.
Using this limits the amount of debug messages produced. Current options are:
error,warning,general,memory,llvm,events,cache,locking,refcounts,timing,hsa,tce,cuda,vulkan,proxy,all.
'error', 'warning', 'general', 'memory', 'llvm', 'events', 'cache', 'locking',
'refcounts', 'timing', 'hsa', 'tce', 'cuda', 'vulkan', 'proxy' and 'all'.
Note: setting POCL_DEBUG to 1 still works and equals error+warning+general.

- **POCL_DEBUG_LLVM_PASSES**
Expand Down Expand Up @@ -221,6 +230,11 @@ pocl.
POCL_TTASIM0_PARAMETERS will be passed to the first ttasim driver instantiated
and POCL_TTASIM1_PARAMETERS to the second one.

- **POCL_DRIVER_VERSION_OVERRIDE**

Can be used to override the driver version reported by PoCL.
See **POCL_CPU_VENDOR_ID_OVERRIDE** for an example use case.

- **POCL_EXTRA_BUILD_FLAGS**

Adds the contents of the environment variable to all clBuildProgram() calls.
Expand Down
4 changes: 4 additions & 0 deletions examples/boxadd/boxadd.c
Expand Up @@ -112,5 +112,9 @@ main (int argc, char **argv)
CHECK_CL_ERROR (clReleaseContext (context));
CHECK_CL_ERROR (clUnloadPlatformCompiler (platform));

free (srcA);
free (srcB);
free (dst);

return err;
}
4 changes: 4 additions & 0 deletions examples/matadd/matadd.c
Expand Up @@ -108,5 +108,9 @@ main (int argc, char **argv)
CHECK_CL_ERROR (clReleaseContext (context));
CHECK_CL_ERROR (clUnloadPlatformCompiler (platform));

free (srcA);
free (srcB);
free (dst);

return err;
}
8 changes: 5 additions & 3 deletions include/pocl.h
Expand Up @@ -47,9 +47,11 @@
/* detects restrict, variadic macros etc */
#include "pocl_compiler_features.h"

/* The maximum file, directory and path name lengths. TODO: These should be
detected from the filesystem properties of the execution platform. */
#define POCL_MAX_DIRNAME_LENGTH 255
/* The maximum file, directory and path name lengths.
NOTE: GDB seems to fail to load symbols from .so files which have
longer pathnames than 511, thus the quite small dir/filename length
limiter. */
#define POCL_MAX_DIRNAME_LENGTH 64
#define POCL_MAX_FILENAME_LENGTH (POCL_MAX_DIRNAME_LENGTH)
#define POCL_MAX_PATHNAME_LENGTH 4096

Expand Down
5 changes: 1 addition & 4 deletions lib/CL/clCreateProgramWithIL.c
Expand Up @@ -128,16 +128,13 @@ CL_API_SUFFIX__VERSION_2_1
POCL_GOTO_ERROR_COND ((length == 0), CL_INVALID_VALUE);

int is_spirv = 0;
#ifdef ENABLE_SPIRV
int is_spirv_kernel
= pocl_bitcode_is_spirv_execmodel_kernel ((const char *)il, length);
is_spirv += is_spirv_kernel;
#endif
#ifdef ENABLE_VULKAN

int is_spirv_shader
= pocl_bitcode_is_spirv_execmodel_shader ((const char *)il, length);
is_spirv += is_spirv_shader;
#endif

POCL_GOTO_ERROR_ON (
(!is_spirv), CL_INVALID_VALUE,
Expand Down
6 changes: 3 additions & 3 deletions lib/CL/clGetDeviceInfo.c
Expand Up @@ -61,14 +61,14 @@ POname(clGetDeviceInfo)(cl_device_id device,
case CL_DEVICE_IMAGE_SUPPORT:
POCL_RETURN_GETINFO(cl_bool, device->image_support);
case CL_DEVICE_TYPE:
POCL_RETURN_GETINFO(cl_device_type, device->type);
POCL_RETURN_GETINFO (cl_device_type, device->type);
case CL_DEVICE_VENDOR_ID:
POCL_RETURN_GETINFO(cl_uint, device->vendor_id);
case CL_DEVICE_MAX_COMPUTE_UNITS:
POCL_RETURN_GETINFO(cl_uint, device->max_compute_units);
case CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS :
POCL_RETURN_GETINFO(cl_uint, device->max_work_item_dimensions);
case CL_DEVICE_MAX_WORK_GROUP_SIZE :
case CL_DEVICE_MAX_WORK_GROUP_SIZE:
{
size_t max_wg_size = device->max_work_group_size;
POCL_RETURN_GETINFO(size_t, max_wg_size);
Expand Down Expand Up @@ -342,7 +342,7 @@ POname(clGetDeviceInfo)(cl_device_id device,
case CL_DEVICE_NON_UNIFORM_WORK_GROUP_SUPPORT:
POCL_RETURN_GETINFO (cl_bool, device->non_uniform_work_group_support);
case CL_DEVICE_WORK_GROUP_COLLECTIVE_FUNCTIONS_SUPPORT:
POCL_RETURN_GETINFO (cl_bool, CL_FALSE);
POCL_RETURN_GETINFO (cl_bool, device->wg_collective_func_support);
case CL_DEVICE_GENERIC_ADDRESS_SPACE_SUPPORT:
POCL_RETURN_GETINFO (cl_bool, device->generic_as_support);
case CL_DEVICE_DEVICE_ENQUEUE_CAPABILITIES:
Expand Down
8 changes: 4 additions & 4 deletions lib/CL/devices/common.c
Expand Up @@ -1984,7 +1984,7 @@ pocl_setup_ils_with_version (cl_device_id dev)
}
}

static const cl_name_version OPENCL_FEATURES[] = {
static const cl_name_version OPENCL_C_FEATURES[] = {
{ CL_MAKE_VERSION (3, 0, 0), "__opencl_c_3d_image_writes" },
{ CL_MAKE_VERSION (3, 0, 0), "__opencl_c_images" },
{ CL_MAKE_VERSION (3, 0, 0), "__opencl_c_read_write_images" },
Expand Down Expand Up @@ -2013,15 +2013,15 @@ static const cl_name_version OPENCL_FEATURES[] = {
{ CL_MAKE_VERSION (3, 0, 0), "__opencl_c_ext_fp64_local_atomic_min_max" },
};

const size_t OPENCL_FEATURES_NUM
= sizeof (OPENCL_FEATURES) / sizeof (OPENCL_FEATURES[0]);
const size_t OPENCL_C_FEATURES_NUM
= sizeof (OPENCL_C_FEATURES) / sizeof (OPENCL_C_FEATURES[0]);

void
pocl_setup_features_with_version (cl_device_id dev)
{
cl_name_version *tmp = NULL;
unsigned ret = pocl_space_delim_string_to_cl_name_version_array (
&tmp, dev->features, OPENCL_FEATURES, OPENCL_FEATURES_NUM);
&tmp, dev->features, OPENCL_C_FEATURES, OPENCL_C_FEATURES_NUM);

dev->num_opencl_features_with_version = ret;
dev->opencl_features_with_version = tmp;
Expand Down
4 changes: 2 additions & 2 deletions lib/CL/devices/common_utils.c
Expand Up @@ -105,9 +105,8 @@ align_ptr (char *p)

#define FALLBACK_MAX_THREAD_COUNT 8

/* initializes CPU-specific device info struct members, that cannot / should
/* Initializes CPU-specific device info default, that cannot / should
not be initialized in pocl_init_default_device_infos() */

cl_int
pocl_cpu_init_common (cl_device_id device)
{
Expand Down Expand Up @@ -139,6 +138,7 @@ pocl_cpu_init_common (cl_device_id device)
device->features = HOST_DEVICE_FEATURES_30;
device->run_program_scope_variables_pass = CL_TRUE;
device->generic_as_support = CL_TRUE;
device->wg_collective_func_support = CL_TRUE;

pocl_setup_opencl_c_with_version (device, CL_TRUE);
pocl_setup_features_with_version (device);
Expand Down
5 changes: 3 additions & 2 deletions lib/CL/devices/cpuinfo.c
Expand Up @@ -305,15 +305,16 @@ pocl_cpuinfo_get_cpu_name_and_vendor(cl_device_id device)
/* default vendor and vendor_id, in case it cannot be found by other means */
device->vendor = cpuvendor_default;
if (device->vendor_id == 0)
device->vendor_id = CL_KHRONOS_VENDOR_ID_POCL;
device->vendor_id = pocl_get_int_option ("POCL_CPU_VENDOR_ID_OVERRIDE",
CL_KHRONOS_VENDOR_ID_POCL);

/* read contents of /proc/cpuinfo */
if (access (cpuinfo, R_OK) != 0)
return;

FILE *f = fopen (cpuinfo, "r");
char contents[MAX_CPUINFO_SIZE];
int num_read = fread (contents, 1, MAX_CPUINFO_SIZE - 1, f);
int num_read = fread (contents, 1, MAX_CPUINFO_SIZE - 1, f);
fclose(f);
contents[num_read]='\0';

Expand Down
4 changes: 3 additions & 1 deletion lib/CL/devices/devices.c
Expand Up @@ -669,7 +669,9 @@ pocl_init_devices ()
a shared global memory. */
dev->global_mem_id = dev_index;
POCL_INIT_OBJECT (dev);
dev->driver_version = POCL_VERSION_FULL;
dev->driver_version = pocl_get_string_option (
"POCL_DRIVER_VERSION_OVERRIDE", POCL_VERSION_FULL);

if (dev->version == NULL)
dev->version = "OpenCL 2.0 pocl";

Expand Down
2 changes: 1 addition & 1 deletion lib/CL/devices/remote/remote.c
Expand Up @@ -686,7 +686,7 @@ setup_relevant_devices (cl_program program, cl_device_id device,
remote_server_data_t *server
= ((remote_device_data_t *)device->data)->server;
unsigned num_relevant_devices = 0;
char program_bc_path[POCL_MAX_FILENAME_LENGTH];
char program_bc_path[POCL_MAX_PATHNAME_LENGTH];
unsigned i, j;

for (i = 0; i < program->num_devices; ++i)
Expand Down
9 changes: 4 additions & 5 deletions lib/CL/pocl_cache.c
Expand Up @@ -92,12 +92,11 @@ void pocl_cache_program_path(char* path,
program_device_dir (path, program, device_i, "");
}

// required in llvm API
void pocl_cache_program_bc_path(char* program_bc_path,
cl_program program,
unsigned device_i) {
program_device_dir(program_bc_path, program,
device_i, POCL_PROGRAM_BC_FILENAME);
program_device_dir (program_bc_path, program,
device_i, POCL_PROGRAM_BC_FILENAME);
}

void
Expand Down Expand Up @@ -208,9 +207,9 @@ pocl_cache_kernel_cachedir (char *kernel_cachedir_path, cl_program program,
{
int bytes_written;
char tempstring[POCL_MAX_PATHNAME_LENGTH];
char file_name[POCL_MAX_DIRNAME_LENGTH + 1];
char file_name[POCL_MAX_FILENAME_LENGTH + 1];

pocl_hash_clipped_name (kernel_name, POCL_MAX_DIRNAME_LENGTH, &file_name[0]);
pocl_hash_clipped_name (kernel_name, POCL_MAX_FILENAME_LENGTH, &file_name[0]);

bytes_written
= snprintf (tempstring, POCL_MAX_PATHNAME_LENGTH, "/%s", file_name);
Expand Down
1 change: 1 addition & 0 deletions lib/CL/pocl_cl.h
Expand Up @@ -853,6 +853,7 @@ struct _cl_device_id {
size_t preferred_wg_size_multiple;
cl_bool non_uniform_work_group_support;
cl_bool generic_as_support;
cl_bool wg_collective_func_support;
cl_uint preferred_vector_width_char;
cl_uint preferred_vector_width_short;
cl_uint preferred_vector_width_int;
Expand Down
1 change: 1 addition & 0 deletions lib/kernel/host/CMakeLists.txt
Expand Up @@ -139,6 +139,7 @@ vload_store_half_f16c.c
vstore.cl
vstore_half.cl
wait_group_events.cl
work_group.c
write_image.cl

###################################################################
Expand Down
34 changes: 3 additions & 31 deletions lib/kernel/subgroups.c
Expand Up @@ -29,37 +29,11 @@

#include <math.h>

/**
* \brief Internal pseudo function which allocates space from the work-group
* thread's stack (basically local memory) for each work-item.
*
* It's expanded in WorkitemLoops.cc to an alloca().
*
* @param element_size The size of an element to allocate (for all WIs in the
* WG).
* @param align The alignment of the start of chunk.
* @param extra_bytes extra bytes to add to the allocation, some functions need
* extra space
* @return pointer to the allocated stack space (freed at unwind).
*/
void *__pocl_work_group_alloca (size_t element_size, size_t align,
size_t extra_bytes);

/**
* \brief Internal pseudo function which allocates space from the work-group
* thread's stack (basically local memory).
*
* It's expanded in WorkitemLoops.cc to an alloca().
*
* @param bytes The size of data to allocate in bytes.
* @param align The alignment of the start of chunk.
* @return pointer to the allocated stack space (freed at unwind).
*/
void *__pocl_local_mem_alloca (size_t bytes, size_t align);

size_t _CL_OVERLOADABLE get_local_size (unsigned int dimindx);
#include "work_group_alloca.h"

size_t _CL_OVERLOADABLE get_local_id (unsigned int dimindx);
size_t _CL_OVERLOADABLE get_local_linear_id (void);
size_t _CL_OVERLOADABLE get_local_size (unsigned int dimindx);

/* Magic variable that is expanded in Workgroup.cc */
extern uint _pocl_sub_group_size;
Expand Down Expand Up @@ -89,8 +63,6 @@ get_enqueued_num_sub_groups (void)
return 1;
}

size_t _CL_OVERLOADABLE get_local_linear_id (void);

uint _CL_OVERLOADABLE
get_sub_group_id (void)
{
Expand Down