[SYCL] Fix event profiling for command_submit in L0 and other backends #7526

raaiq1 · 2022-11-24T15:20:03Z

According to SYCL 2020 specification, the timeframe to calculated command submission time:

... always some time after the command group function object returns and before the associated call to queue::submit returns.

Currently, command submission time is calculated when a command is submitted to the underlying device which may not necessarily be before queue::submit returns, e.g host_accessor blocking command submission until it's destroyed.
This patch changes that timeframe to be always before queue::submit returns, specifically right after being persisted by the graph builder and before being enqueued by graph processor.

Signed-off-by: Rauf, Rana <rana.rauf@intel.com>

raaiq1 · 2022-11-29T21:23:48Z

@smaslov-intel For querying the device time at the plugin layer. Is it justified to have a separate method: piDeviceGetCurrentTime or should it be an extension to piDeviceGetInfo?

smaslov-intel · 2022-11-29T22:08:13Z

@smaslov-intel For querying the device time at the plugin layer. Is it justified to have a separate method: piDeviceGetCurrentTime or should it be an extension to piDeviceGetInfo?

Please explain the entire design you have in mind to make a call. Why is the new query needed and how it will be used for obtaining "command_submit" time from Level Zero backend? What about other backends?

raaiq1 · 2022-11-30T16:37:43Z

So the the main idea is to query the device timestamp right after a command group is submitted to the scheduler or to L0 and use that as submission time; currently there's no existing function in PI API to get the timestamp

In detail, I would to add a 'getDeviceCurrentTime' function to device_impl class, which would query the current time only for level-zero backend -support for other backends would be enabled in future PR - returning 0 otherwise. This function would simply call a pi function (piDeviceGetCurrentTIme or similar) to retrieve the current device timestamp. This function would be invoked under handler::finalize() right before the method returns here and here, so the submission time would be calculated only if the submission was successful (no exceptions thrown).

raaiq1 · 2022-11-30T18:57:32Z

What about other backends?
I plan to use a similar approach on getting the command_submit time for the other backends

smaslov-intel · 2022-11-30T19:07:09Z

So the the main idea is to query the device timestamp right after a command group is submitted to the scheduler or to L0 and use that as submission time; currently there's no existing function in PI API to get the timestamp

In detail, I would to add a 'getDeviceCurrentTime' function to device_impl class, which would query the current time only for level-zero backend -support for other backends would be enabled in future PR - returning 0 otherwise. This function would simply call a pi function (piDeviceGetCurrentTIme or similar) to retrieve the current device timestamp. This function would be invoked under handler::finalize() right before the method returns here and here, so the submission time would be calculated only if the submission was successful (no exceptions thrown).

Why do you want to expose "current" device timestamp to SYCL (via PI)? Why not keep it internal to L0 Plugin, and record it whenever needed (either when a command put to a batch, or when that batch is submitted)?

raaiq1 · 2022-11-30T19:32:01Z

Why do you want to expose "current" device timestamp to SYCL (via PI)? Why not keep it internal to L0 Plugin, and record it whenever needed (either when a command put to a batch, or when that batch is submitted)?

In the edge case Vlad Romanov presented :

accessor HostAcc = Buf.get_access();
auto e=Q.submit(/*a kernel writing to Buf*/); // Command1
std::cout << e.get_profiling_info<info::event_profiling::command_submit>(); //line 3
HostAcc.~accessor();
e.wait(); //line 5

The host accessor would basically block enqueuing the Command1 kernel ( done through piEnqueueKernelLaunch) until the accessor is destroyed. So in this case, the kernel is not submitted to L0 until line 5, which is after the queue::submit() function returns and not according to specifications

smaslov-intel · 2022-11-30T19:45:42Z

So in this case, the kernel is not submitted to L0 until line 5, which is after the queue::submit() function returns and not according to specifications

What exactly in the spec that is violated?

raaiq1 · 2022-11-30T19:55:31Z

What exactly in the spec that is violated?

Under table 37 of SYCL 2020, this is the description for info::event_profiling::command_submit :

Returns a timestamp telling when the associated command group was submitted to the queue. This is always some time after the command group function object returns and before the associated call to queue::submit returns.

--Edit: Sorry for being vague. So command_submit time must be calculated before submit function returns while in the case host_accessor, it would be calculated after the submit method returns using the L0 approach

smaslov-intel · 2022-12-01T23:54:47Z

What exactly in the spec that is violated?

Under table 37 of SYCL 2020, this is the description for info::event_profiling::command_submit :

Returns a timestamp telling when the associated command group was submitted to the queue. This is always some time after the command group function object returns and before the associated call to queue::submit returns.

--Edit: Sorry for being vague. So command_submit time must be calculated before submit function returns while in the case host_accessor, it would be calculated after the submit method returns using the L0 approach

So this essentially requires to record the time of the queue::submit() and not the time of the time of the "command" submit to execution queue. This sounds odd to me especially that the timestamp is a device's time, and the device wouldn't be even involved at that point.

I think this is related to another ongoing discussion with @gmlueck that we admit that SYCL queue::submit() doesn't guarantee that commands are physically submitted, so what is the reason to request the timestamp of command_submit is calculated before submit function returns?

@gmlueck : can you add your perspective please?

gmlueck · 2022-12-02T16:08:17Z

It seems natural for the user to want to know the delay between when a command is submitted and when that command starts executing on hardware. The command_start timestamp tells when it starts executing, so we need command_submit to allow applications to compute the delay. Both timestamps must have the same "timebase", so that it is valid for applications to compute the difference.

We had discussions on how this could be implemented when we clarified this part of the spec. Here are the relevant notes:

Adding some notes here about how these profiling timestamps could be implemented on OpenCL. As noted above, info::event_profiling::command_start corresponds to CL_PROFILING_COMMAND_START and info::event_profiling::command_end corresponds to CL_PROFILING_COMMAND_END.

Implementing info::event_profiling::command_submit requires some way to get a timestamp from host code, but using the device clock. (More precisely, the timestamp must have the same "timebase" as the timestamps used for info::event_profiling::command_start and info::event_profiling::command_end.) On OpenCL, this can be achieved by using clGetDeviceAndHostTimer and clGetHostTimer.

The implementation needs to call clGetDeviceAndHostTimer just once for each device. This returns a synchronized pair of timestamps: one from the host clock and one from the device clock. Subtracting the two gives the delta between the two clocks. Then, the implementation needs to call clGetHostTimer each time a command is put on the SYCL queue. This retrieves the timestamp using the "host clock". The implementation can then add the delta (computed earlier) to convert this timestamp to the "device clock", and this is the timestamp to return for info::event_profiling::command_submit.

Note that these two APIs (clGetDeviceAndHostTimer and clGetHostTimer) were added in OpenCL 2.1, so it seems likely that this part of the SYCL API could not be implemented on a pure OpenCL 1.2 implementation. This does not seem problematic, though, especially since the SYCL profiling APIs are optional for each device.

When I wrote these notes, I recall that the same strategy would work for Level Zero, so I think there are similar Level Zero APIs to the OpenCL ones I list above.

smaslov-intel · 2022-12-02T16:53:34Z

It seems natural for the user to want to know the delay between when a command is submitted and when that command starts executing on hardware.

Agreed. The remaining question is when to capture the "submit" time. Is it OK to do it at the time when piEnqueue* is called? (I'd like to avoid PI API extension for just capturing device's time)

gmlueck · 2022-12-02T16:59:22Z

The remaining question is when to capture the "submit" time. Is it OK to do it at the time when piEnqueue* is called? (I'd like to avoid PI API extension for just capturing device's time)

The SYCL spec says:

This is always some time after the command group function object returns and before the associated call to queue::submit returns

Does that correspond to when piEnqueue* is called?

raaiq1 · 2022-12-02T17:24:48Z

Does that correspond to when piEnqueue* is called?

There is an edge case where that's not necessarily true. If a host accessor is constructed before submitting a kernel (through queue::submit() ) with the same memory dependency, then piEnqueue of that kernel is deferred until the host accessor is destroyed.

gmlueck · 2022-12-02T17:40:56Z

There is an edge case where that's not necessarily true. If a host accessor is constructed before submitting a kernel (through queue::submit() ) with the same memory dependency, then piEnqueue of that kernel is deferred until the host accessor is destroyed.

I think it would make sense to capture the timestamp at the same point that the spec says even in this case. Is it hard to add a new PI API to capture the timestamp?

smaslov-intel · 2022-12-02T18:31:49Z

There is an edge case where that's not necessarily true. If a host accessor is constructed before submitting a kernel (through queue::submit() ) with the same memory dependency, then piEnqueue of that kernel is deferred until the host accessor is destroyed.

I think it would make sense to capture the timestamp at the same point that the spec says even in this case. Is it hard to add a new PI API to capture the timestamp?

Technically, it is not hard, but it should be coordinated/approved for Unified Runtime API. It would probably be just a new info for existing piDeviceGetInfo. Just again, isn't it more useful to the end users to note the time when the kernel was actually summited by the host (piEnqueue* is called)?

gmlueck · 2022-12-02T18:53:24Z

It seems to me that the dependency on the host_accessor is no different than a dependency on a buffer through a regular accessor. In both cases, the kernel cannot be executed until the memory is available. Why should we treat the host_accessor case differently from the regular accessor case?

smaslov-intel · 2022-12-05T18:43:07Z

There is an edge case where that's not necessarily true. If a host accessor is constructed before submitting a kernel (through queue::submit() ) with the same memory dependency, then piEnqueue of that kernel is deferred until the host accessor is destroyed.

I think it would make sense to capture the timestamp at the same point that the spec says even in this case. Is it hard to add a new PI API to capture the timestamp?

Technically, it is not hard, but it should be coordinated/approved for Unified Runtime API. It would probably be just a new info for existing piDeviceGetInfo. Just again, isn't it more useful to the end users to note the time when the kernel was actually summited by the host (piEnqueue* is called)?

@raaiq1 : At this time I am not opposed to the PI API extension but we will also need to get it into Unified Runtime, tag @kbenzie

kbenzie · 2022-12-06T16:34:39Z

This looks a lot like clGetDeviceAndHostTimer. Should we use an analogue of this instead?

raaiq1 · 2022-12-06T16:48:47Z

I'm okay with using an analogue of clGetDeviceAndHostTimer, maybe have the PI API analogue be piGetDeviceAndHostTime.

kbenzie · 2022-12-06T16:51:56Z

I'm okay with using an analogue of clGetDeviceAndHostTimer, maybe have the PI API analogue be piGetDeviceAndHostTime.

Apologies, I should have expanded. This was a question about using and analogue of clGetDeviceAndHostTimer in Unified Runtime. Sounds like that would be acceptable, if so I'll create a ticket to get that added to the UR spec.

Ideally, we would having the PI and UR entry points match but I don't think its a strict requirement.

smaslov-intel · 2022-12-06T16:55:40Z

Ideally, we would having the PI and UR entry points match but I don't think its a strict requirement.

Please have them match to avoid redundant frictions in SYCL RT having to deal with both paths for some (long) time.

kbenzie · 2022-12-06T16:58:21Z

@smaslov-intel that works for me 👍

In that case, yes piGetDeviceAndHostTimer being the analogue would be my preference to better align with OpenCL where we can.

gmlueck · 2022-12-06T16:59:55Z

OpenCL has two different functions clGetDeviceAndHostTimer and clGetHostTimer. Shouldn't we add PI / UR interfaces for each?

kbenzie · 2022-12-06T17:05:06Z

Perhaps, although I'm only aware of a direct use case for the first at this time.

If you decide to also add a clGetHostTimer analogue please update oneapi-src/unified-runtime#88 with the request 🙂

gmlueck · 2022-12-06T17:13:00Z

Perhaps, although I'm only aware of a direct use case for the first at this time.

The algorithm I propose in the comment above would use them both. Is the code using that algorithm?

raaiq1 · 2022-12-06T17:31:33Z

The algorithm I propose in #7526 (comment) would use them both. Is the code using that algorithm?

The algorithm is a little bit more straight forward where submitting a task to the sycl queue just calls clGetDeviceAndHostTimer directly each time. I couldn't observe much of a performance difference compared to when using suggested method, but I'm open to use the suggested method

gmlueck · 2022-12-06T17:46:36Z

I couldn't observe much of a performance difference compared to when using #7526 (comment), but I'm open to use the suggested method

FWIW, OpenCL documents that clGetDeviceAndHostTimer may have high latency, which was why I suggested calling it only once:

Implementations may need to execute this query with a high latency in order to provide reasonable synchronization of the timestamps.

I'm not sure whether our implementation really does have higher latency, though.

…another_design

Signed-off-by: Rauf, Rana <rana.rauf@intel.com>

smaslov-intel · 2022-12-22T16:53:55Z

sycl/plugins/level_zero/pi_level_zero.cpp

@@ -5988,7 +5988,6 @@ pi_result piEventGetProfilingInfo(pi_event Event, pi_profiling_info ParamName,
  }
  case PI_PROFILING_INFO_COMMAND_QUEUED:
  case PI_PROFILING_INFO_COMMAND_SUBMIT:
-    // TODO: Support these when Level Zero supported is added.
    return ReturnValue(uint64_t{0});


Returning "0" is still not the right behavior. We should use zeDeviceGetGlobalTimestamps to record the time when commands were physically submitted to the device (inside plugin). I am OK if you just add a TODO comment and not fix it in this PR since that will still be unused (btw, also mention this that there are currently no users of this).

Signed-off-by: Rauf, Rana <rana.rauf@intel.com>

raaiq1 · 2022-12-22T20:23:23Z

ping to @intel/llvm-gatekeepers to merge

raaiq1 · 2022-12-22T20:28:21Z

ping @intel/llvm-reviewers-runtime , @intel/llvm-reviewers-cuda and @intel/dpcpp-esimd-reviewers for review

smaslov-intel · 2023-01-03T18:42:21Z

@againull : please review/merge this

bader · 2023-01-05T22:49:40Z

ping @intel/llvm-reviewers-cuda and @intel/dpcpp-esimd-reviewers for review

againull · 2023-01-10T06:57:25Z

ESIMD changes are trivial, so merging this PR.

This patch moves the CUDA context from the PI context to the PI device, and switches to always using the primary context. CUDA contexts are different from SYCL contexts in that they're tied to a single device, and that they are required to be active on a thread for most calls to the CUDA driver API. As shown in intel#8124 and intel#7526 the current mapping of CUDA context to PI context, causes issues for device based entry points that still need to call the CUDA APIs, we have workarounds to solve that but they're a bit hacky, inefficient, and have a lot of edge case issues. The peer to peer interface proposal in intel#6104, is also device based, but enabling peer to peer for CUDA is done on the CUDA contexts, so the current mapping would make it difficult to implement. So this patch solves most of these issues by decoupling the CUDA context from the SYCL context, and simply managing the CUDA contexts in the devices, it also changes the CUDA context management to always use the primary context. This approach as a number of advantages: * Use of the primary context is recommended by Nvidia * Simplifies the CUDA context management in the plugin * Available CUDA context in device based entry points * Likely more efficient in the general case, with less opportunities to accidentally cause costly CUDA context switches. * Easier and likely more efficient interactions with CUDA runtime applications. * Easier to expose P2P capabilities * Easier to support multiple devices in a SYCL context It does have a few drawbacks from the previous approach: * Drops support for `make_context` interop, no sensible "native handle" to pass in (`get_native` is still supported fine). * No opportunity for users to separate their work into different CUDA contexts. It's unclear if there's any actual use case for this, it seems very uncommon in CUDA codebases to have multiple CUDA contexts for a single CUDA device in the same process. So overall I believe this should be a net benefit in general, and we could revisit if we run into an edge case that would need more fine grained CUDA context management.

This patch moves the CUDA context from the PI context to the PI device, and switches to always using the primary context. CUDA contexts are different from SYCL contexts in that they're tied to a single device, and that they are required to be active on a thread for most calls to the CUDA driver API. As shown in #8124 and #7526 the current mapping of CUDA context to PI context, causes issues for device based entry points that still need to call the CUDA APIs, we have workarounds to solve that but they're a bit hacky, inefficient, and have a lot of edge case issues. The peer to peer interface proposal in #6104, is also device based, but enabling peer to peer for CUDA is done on the CUDA contexts, so the current mapping would make it difficult to implement. So this patch solves most of these issues by decoupling the CUDA context from the SYCL context, and simply managing the CUDA contexts in the devices, it also changes the CUDA context management to always use the primary context. This approach as a number of advantages: * Use of the primary context is recommended by Nvidia * Simplifies the CUDA context management in the plugin * Available CUDA context in device based entry points * Likely more efficient in the general case, with less opportunities to accidentally cause costly CUDA context switches. * Easier and likely more efficient interactions with CUDA runtime applications. * Easier to expose P2P capabilities * Easier to support multiple devices in a SYCL context It does have a few drawbacks from the previous approach: * Drops support for `make_context` interop, no sensible "native handle" to pass in (`get_native` is still supported fine). * No opportunity for users to separate their work into different CUDA contexts. It's unclear if there's any actual use case for this, it seems very uncommon in CUDA codebases to have multiple CUDA contexts for a single CUDA device in the same process. So overall I believe this should be a net benefit in general, and we could revisit if we run into an edge case that would need more fine grained CUDA context management.

[SYCL] Implement command_submit L0

7973c58

Signed-off-by: Rauf, Rana <rana.rauf@intel.com>

raaiq1 mentioned this pull request Nov 24, 2022

[SYCL] Implement event_profiling::command_submit for level-zero #7403

Closed

kbenzie mentioned this pull request Dec 6, 2022

Introduce a clGetDeviceAndHostTimer analogue oneapi-src/unified-runtime#88

Closed

raaiq1 added 4 commits December 20, 2022 14:17

Merge branch 'another_design' of https://github.com/raaiq1/llvm into …

3337ecb

…another_design

Fix test fails

eda7e39

Signed-off-by: Rauf, Rana <rana.rauf@intel.com>

Fix CUDA fails again

43b24e7

Signed-off-by: Rauf, Rana <rana.rauf@intel.com>

Formatting

e834879

Signed-off-by: Rauf, Rana <rana.rauf@intel.com>

raaiq1 changed the title ~~[SYCL] Implement command_submit L0~~ [SYCL] Fix event profiling for command_submit in L0 and other backends Dec 22, 2022

Remove removal comments

37f2e53

Signed-off-by: Rauf, Rana <rana.rauf@intel.com>

smaslov-intel reviewed Dec 22, 2022

View reviewed changes

raaiq1 added 2 commits December 22, 2022 09:36

Added TODO comment

2df30af

Signed-off-by: Rauf, Rana <rana.rauf@intel.com>

Remove bad comment

dc18419

Signed-off-by: Rauf, Rana <rana.rauf@intel.com>

smaslov-intel approved these changes Dec 22, 2022

View reviewed changes

mdtoguchi removed the request for review from a team December 22, 2022 19:20

AlexeySachkov removed request for a team December 23, 2022 09:11

againull self-requested a review January 3, 2023 19:08

againull approved these changes Jan 3, 2023

View reviewed changes

againull merged commit 71d7797 into intel:sycl Jan 10, 2023

JackAKirk mentioned this pull request Jan 10, 2023

[SYCL][CUDA] Introduce sycl_ext_oneapi_cuda_tex_cache_read extension #7397

Merged

npmiller mentioned this pull request Feb 3, 2023

[SYCL][CUDA] Decouple CUDA contexts from PI contexts #8197

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SYCL] Fix event profiling for command_submit in L0 and other backends #7526

[SYCL] Fix event profiling for command_submit in L0 and other backends #7526

raaiq1 commented Nov 24, 2022 •

edited

Loading

raaiq1 commented Nov 29, 2022

smaslov-intel commented Nov 29, 2022

raaiq1 commented Nov 30, 2022 •

edited

Loading

raaiq1 commented Nov 30, 2022

smaslov-intel commented Nov 30, 2022

raaiq1 commented Nov 30, 2022 •

edited

Loading

smaslov-intel commented Nov 30, 2022

raaiq1 commented Nov 30, 2022 •

edited

Loading

smaslov-intel commented Dec 1, 2022

gmlueck commented Dec 2, 2022

smaslov-intel commented Dec 2, 2022

gmlueck commented Dec 2, 2022

raaiq1 commented Dec 2, 2022

gmlueck commented Dec 2, 2022

smaslov-intel commented Dec 2, 2022

gmlueck commented Dec 2, 2022

smaslov-intel commented Dec 5, 2022

kbenzie commented Dec 6, 2022 •

edited

Loading

raaiq1 commented Dec 6, 2022

kbenzie commented Dec 6, 2022

smaslov-intel commented Dec 6, 2022

kbenzie commented Dec 6, 2022

gmlueck commented Dec 6, 2022

kbenzie commented Dec 6, 2022

gmlueck commented Dec 6, 2022

raaiq1 commented Dec 6, 2022 •

edited

Loading

gmlueck commented Dec 6, 2022

smaslov-intel Dec 22, 2022

raaiq1 commented Dec 22, 2022

raaiq1 commented Dec 22, 2022

smaslov-intel commented Jan 3, 2023

bader commented Jan 5, 2023

againull commented Jan 10, 2023

[SYCL] Fix event profiling for command_submit in L0 and other backends #7526

[SYCL] Fix event profiling for command_submit in L0 and other backends #7526

Conversation

raaiq1 commented Nov 24, 2022 • edited Loading

raaiq1 commented Nov 29, 2022

smaslov-intel commented Nov 29, 2022

raaiq1 commented Nov 30, 2022 • edited Loading

raaiq1 commented Nov 30, 2022

smaslov-intel commented Nov 30, 2022

raaiq1 commented Nov 30, 2022 • edited Loading

smaslov-intel commented Nov 30, 2022

raaiq1 commented Nov 30, 2022 • edited Loading

smaslov-intel commented Dec 1, 2022

gmlueck commented Dec 2, 2022

smaslov-intel commented Dec 2, 2022

gmlueck commented Dec 2, 2022

raaiq1 commented Dec 2, 2022

gmlueck commented Dec 2, 2022

smaslov-intel commented Dec 2, 2022

gmlueck commented Dec 2, 2022

smaslov-intel commented Dec 5, 2022

kbenzie commented Dec 6, 2022 • edited Loading

raaiq1 commented Dec 6, 2022

kbenzie commented Dec 6, 2022

smaslov-intel commented Dec 6, 2022

kbenzie commented Dec 6, 2022

gmlueck commented Dec 6, 2022

kbenzie commented Dec 6, 2022

gmlueck commented Dec 6, 2022

raaiq1 commented Dec 6, 2022 • edited Loading

gmlueck commented Dec 6, 2022

smaslov-intel Dec 22, 2022

Choose a reason for hiding this comment

raaiq1 commented Dec 22, 2022

raaiq1 commented Dec 22, 2022

smaslov-intel commented Jan 3, 2023

bader commented Jan 5, 2023

againull commented Jan 10, 2023

raaiq1 commented Nov 24, 2022 •

edited

Loading

raaiq1 commented Nov 30, 2022 •

edited

Loading

raaiq1 commented Nov 30, 2022 •

edited

Loading

raaiq1 commented Nov 30, 2022 •

edited

Loading

kbenzie commented Dec 6, 2022 •

edited

Loading

raaiq1 commented Dec 6, 2022 •

edited

Loading