-
Notifications
You must be signed in to change notification settings - Fork 715
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[SYCL] Add spec for sycl_ext_intel_cslice (#7513)
Add a proposed extension specification that allows partitioning a device by "cslice" (aka CCS-es). Co-authored-by: Ben Ashbaugh <ben.ashbaugh@intel.com> Co-authored-by: jbrodman <brodman@gmail.com>
- Loading branch information
1 parent
91b1515
commit 5777e1f
Showing
1 changed file
with
279 additions
and
0 deletions.
There are no files selected for viewing
279 changes: 279 additions & 0 deletions
279
sycl/doc/extensions/proposed/sycl_ext_intel_cslice.asciidoc
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,279 @@ | ||
= sycl_ext_intel_cslice | ||
|
||
:source-highlighter: coderay | ||
:coderay-linenums-mode: table | ||
|
||
// This section needs to be after the document title. | ||
:doctype: book | ||
:toc2: | ||
:toc: left | ||
:encoding: utf-8 | ||
:lang: en | ||
:dpcpp: pass:[DPC++] | ||
|
||
// Set the default source code type in this document to C++, | ||
// for syntax highlighting purposes. This is needed because | ||
// docbook uses c++ and html5 uses cpp. | ||
:language: {basebackend@docbook:c++:cpp} | ||
|
||
|
||
== Notice | ||
|
||
[%hardbreaks] | ||
Copyright (C) 2022-2022 Intel Corporation. All rights reserved. | ||
|
||
Khronos(R) is a registered trademark and SYCL(TM) and SPIR(TM) are trademarks | ||
of The Khronos Group Inc. OpenCL(TM) is a trademark of Apple Inc. used by | ||
permission by Khronos. | ||
|
||
|
||
== Contact | ||
|
||
To report problems with this extension, please open a new issue at: | ||
|
||
https://github.com/intel/llvm/issues | ||
|
||
|
||
== Dependencies | ||
|
||
This extension is written against the SYCL 2020 revision 6 specification. All | ||
references below to the "core SYCL specification" or to section numbers in the | ||
SYCL specification refer to that revision. | ||
|
||
|
||
== Status | ||
|
||
This is a proposed extension specification, intended to gather community | ||
feedback. Interfaces defined in this specification may not be implemented yet | ||
or may be in a preliminary state. The specification itself may also change in | ||
incompatible ways before it is finalized. *Shipping software products should | ||
not rely on APIs defined in this specification.* | ||
|
||
|
||
== Overview | ||
|
||
:multi-CCS: https://github.com/intel/compute-runtime/blob/master/level_zero/doc/experimental_extensions/MULTI_CCS_MODES.md | ||
|
||
Some Intel GPU devices can be partitioned at a granularity of "cslice" (compute | ||
slice), which is a smaller granularity than "tile". This form of partitioning | ||
is not currently enabled by default, so it is considered an advanced feature | ||
which most applications are not expected to use. This extension provides a way | ||
for these advanced applications to partition a device by cslice when it is | ||
enabled in the device driver. | ||
|
||
Unlike "tile" partitions, a cslice partition does not have any different cache | ||
affinity from its sibling cslice partitions. Therefore, this extension does | ||
not expose this type of partitioning through | ||
`info::partition_property::partition_by_affinity_domain`. Instead, it adds a | ||
new partitioning type | ||
`info::partition_property::ext_intel_partition_by_cslice`. | ||
|
||
The only Intel GPU devices that currently support this type of partitioning | ||
are the Data Center GPU Max series (aka PVC), and this support is only | ||
available when the device driver is configured in {multi-CCS}[multi-CCS] mode. | ||
See that documentation for instructions on how to enable this mode and for | ||
other important information. Currently, it is only possible to partition a | ||
device by cslice if the driver is in "2 CCS Mode" or "4 CCS Mode". When in | ||
2 CCS Mode, a tile can be partitioned into two cslice sub-devices. When in | ||
4 CCS Mode, a tile can be partitioned into four cslice sub-devices. | ||
|
||
This type of partitioning is currently supported only at the "tile" level. | ||
A device should be queried using `info::device::partition_properties` to | ||
determine if it supports partitioning by `ext_intel_partition_by_cslice`. If a | ||
device does not support partitioning by `ext_intel_partition_by_cslice` it may | ||
first need to be partitioned into per-tile sub-devices via | ||
`partition_by_affinity_domain`, and then each of the resulting sub-devices may | ||
be further partitioned by `ext_intel_partition_by_cslice`. | ||
|
||
It is important to understand that the device driver virtualizes work | ||
submission to the cslice sub-devices. (More specifically, the device driver | ||
virtualizes work submission to different CCS-es, and this means that on Data | ||
Center GPU Max series devices the work submission to a cslice is virtualized.) | ||
This virtualization happens only between processes, and not within a single | ||
process. For example, consider a single process that constructs two SYCL | ||
queues on cslice sub-device #0. Kernels submitted to these two queues are | ||
guaranteed to conflict, both using the same set of execution units. Therefore, | ||
if a single process wants to explicitly submit kernels to cslice sub-devices | ||
and it wants to avoid conflict, it should create queues on different | ||
sub-devices. By contrast, consider an example where two separate processes | ||
create a SYCL queue on cslice sub-device #0. In this case, the device driver | ||
virtualizes access to this cslice, and kernels submitted from the first process | ||
may run on different execution units than kernels submitted from the second | ||
process. In this second case, the device driver binds the process's requested | ||
cslice to a physical cslice according to the overall system load. | ||
|
||
Note that this extension can be supported by any implementation. If an | ||
implementation supports a backend or device without the concept of cslice | ||
partitions it can still conform to this extension by declaring the new | ||
enumerator and member functions specified below. If the info descriptor query | ||
`info::device::partition_properties` does not report | ||
`ext_intel_partition_by_cslice`, then the backend or device does not support | ||
the creation of cslice partitions. | ||
|
||
|
||
== Specification | ||
|
||
=== Feature test macro | ||
|
||
This extension provides a feature-test macro as described in the core SYCL | ||
specification. An implementation supporting this extension must predefine the | ||
macro `SYCL_EXT_INTEL_CSLICE` to one of the values defined in the table | ||
below. Applications can test for the existence of this macro to determine if | ||
the implementation supports this feature, or applications can test the macro's | ||
value to determine which of the extension's features the implementation | ||
supports. | ||
|
||
[%header,cols="1,5"] | ||
|=== | ||
|Value | ||
|Description | ||
|
||
|1 | ||
|Initial version of this extension. | ||
|=== | ||
|
||
=== New partition property | ||
|
||
This extension adds a new enumerator named `ext_intel_partition_by_cslice` to | ||
`info::partition_property`: | ||
|
||
``` | ||
namespace sycl::info { | ||
|
||
enum class partition_property : /* unspecified */ { | ||
// ... | ||
ext_intel_partition_by_cslice | ||
}; | ||
|
||
} // namespace sycl::info | ||
``` | ||
|
||
The behavior of the `info::device::partition_properties` info descriptor query | ||
is also extended to include `ext_intel_partition_by_cslice` in the vector of | ||
returned values if the device can be partitioned into at least two sub-devices | ||
along that partition property. | ||
|
||
=== New function template specialization to create sub-devices | ||
|
||
This extension adds a new function template specialization to the `device` | ||
class: | ||
|
||
``` | ||
namespace sycl { | ||
|
||
class device { | ||
// ... | ||
|
||
// Available only when | ||
// Prop == info::partition_property::ext_intel_partition_by_cslice | ||
template <info::partition_property Prop> | ||
std::vector<device> create_sub_devices() const; | ||
}; | ||
|
||
} // namespace sycl | ||
``` | ||
|
||
This function only participates in overload resolution if the `Prop` template | ||
parameter is `info::partition_property::ext_intel_partition_by_cslice`. It | ||
returns a `std::vector` of sub-devices partitioned from this SYCL `device`, | ||
each representing a fixed set of hardware cslices. | ||
|
||
If the SYCL `device` does not support | ||
`info::partition_property::ext_intel_partition_by_cslice`, calling this | ||
function throws a synchronous `exception` with the | ||
`errc::feature_not_supported` error code. | ||
|
||
=== Behavior of device info queries for a "cslice" sub-device | ||
|
||
This section describes the behavior for some of the device info queries when | ||
applied to a `device` object that represents a "cslice" partition. | ||
|
||
* `info::device::partition_type_property` | ||
+ | ||
Returns `ext_intel_partition_by_cslice`. | ||
|
||
* `info::device::max_compute_units` | ||
+ | ||
When partitioning by `ext_intel_partition_by_cslice`, each sub-device | ||
represents a fixed subset of the parent device's compute units. This query | ||
returns the number of compute units represented by the sub-device. | ||
|
||
The remaining device info queries return the properties or limits of the | ||
sub-device, as is typical for these queries. In general, if a resource is | ||
partitioned among the sub-devices, then the associated info query will | ||
return each sub-device's share of the resource. However, if a resource is | ||
shared by all of the sub-devices, then the associated info query for each | ||
sub-device will return the same value as for the parent device. For example, | ||
if device global memory is shared by all cslice partitions in a tile, then the | ||
info query `info::device::global_mem_size` will return the same value for the | ||
`device` object representing the tile as for the `device` object representing | ||
a cslice. | ||
|
||
=== Behavior of the Level Zero backend interop functions | ||
|
||
The Level Zero device driver doesn't use the concept of sub-device to represent | ||
a fixed partition of cslices. Instead, a Level Zero command queue can be | ||
created with a particular queue index, which represents a partition of the | ||
cslices. | ||
|
||
As a result, calling `get_native` for a SYCL `device` that represents a cslice | ||
partition returns the same `ze_device_handle_t` as the parent device. If an | ||
application wants a native handle to the cslice partition, it must create a | ||
SYCL `queue` and then call `get_native` on the `queue`. This will return a | ||
`ze_command_queue_handle_t` that corresponds to the cslice partition. | ||
|
||
=== Behavior of the OpenCL backend interop functions | ||
|
||
The OpenCL device driver doesn't use the concept of sub-device to represent a | ||
fixed partition of cslices. Instead, an OpenCL command queue can be created | ||
with a particular queue index, which represents a partition of the cslices. | ||
|
||
As a result, calling `get_native` for a SYCL `device` that represents a cslice | ||
partition returns the same `cl_device_id` as the parent device. If an | ||
application wants a native handle to the cslice partition, it must create a | ||
SYCL `queue` and then call `get_native` on the `queue`. This will return a | ||
`cl_command_queue` that corresponds to the cslice partition. | ||
|
||
|
||
== Impact on the ONEAPI_DEVICE_SELECTOR environment variable | ||
|
||
:oneapi-device-selector: https://github.com/intel/llvm/blob/sycl/sycl/doc/EnvironmentVariables.md#oneapi_device_selector | ||
|
||
This section describes the effect of this extension on the {dpcpp} | ||
`ONEAPI_DEVICE_SELECTOR` environment variable. Since this environment variable | ||
is not part of the SYCL specification, this section is not a normative part of | ||
the extension specification. Rather, it only describes the impact on {dpcpp}. | ||
|
||
As described in the {oneapi-device-selector}[documentation] for the | ||
`ONEAPI_DEVICE_SELECTOR`, a term in the selector string can be an integral | ||
number followed by a decimal point (`.`), where the decimal point indicates a | ||
sub-device. For example, `1.2` means sub-device #2 of device #1. These | ||
decimal points can represent either a sub-device created via | ||
`partition_by_affinity_domain` or via `ext_intel_partition_by_cslice`. When | ||
{dpcpp} processes a term with a decimal point, it first attempts to partition | ||
by `ext_intel_partition_by_cslice`. If that is not possible, it next attempts | ||
to partition by `partition_by_affinity_domain` / | ||
`partition_affinity_domain::next_partitionable`. | ||
|
||
It is important to keep in mind, though, that requesting a specific cslice via | ||
this environment variable has limited effect due to the device driver's | ||
virtualization of cslices. To illustrate, consider an example where two | ||
processes are launched as follows, selecting different cslice sub-devices: | ||
|
||
``` | ||
$ ONEAPI_DEVICE_SELECTOR=level_zero:0.1.0 ZEX_NUMBER_OF_CCS=0:2 ./foo | ||
$ ONEAPI_DEVICE_SELECTOR=level_zero:0.1.1 ZEX_NUMBER_OF_CCS=0:2 ./foo | ||
``` | ||
|
||
The first process selects cslice #0 while the second selects cslice #1. This | ||
does have the effect that each process is constrained to a single cslice (which | ||
is not the {dpcpp} default). However, the actual cslice number is irrelevant. | ||
Because of cslice virtualization, the device driver will choose some available | ||
cslice for each process instead of honoring the value requested in the | ||
environment variable. As a result, the following example has exactly the same | ||
effect: | ||
|
||
``` | ||
$ ONEAPI_DEVICE_SELECTOR=level_zero:0.1.0 ZEX_NUMBER_OF_CCS=0:2 ./foo | ||
$ ONEAPI_DEVICE_SELECTOR=level_zero:0.1.0 ZEX_NUMBER_OF_CCS=0:2 ./foo | ||
``` |