Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SYCL] Fix event profiling for command_submit in L0 and other backends #7526

Merged
merged 62 commits into from
Jan 10, 2023

Conversation

raaiq1
Copy link
Contributor

@raaiq1 raaiq1 commented Nov 24, 2022

According to SYCL 2020 specification, the timeframe to calculated command submission time:

... always some time after the command group function object returns and before the associated call to queue::submit returns.

Currently, command submission time is calculated when a command is submitted to the underlying device which may not necessarily be before queue::submit returns, e.g host_accessor blocking command submission until it's destroyed.
This patch changes that timeframe to be always before queue::submit returns, specifically right after being persisted by the graph builder and before being enqueued by graph processor.

Signed-off-by: Rauf, Rana <rana.rauf@intel.com>
@raaiq1
Copy link
Contributor Author

raaiq1 commented Nov 29, 2022

@smaslov-intel For querying the device time at the plugin layer. Is it justified to have a separate method: piDeviceGetCurrentTime or should it be an extension to piDeviceGetInfo?

@smaslov-intel
Copy link
Contributor

@smaslov-intel For querying the device time at the plugin layer. Is it justified to have a separate method: piDeviceGetCurrentTime or should it be an extension to piDeviceGetInfo?

Please explain the entire design you have in mind to make a call. Why is the new query needed and how it will be used for obtaining "command_submit" time from Level Zero backend? What about other backends?

@raaiq1
Copy link
Contributor Author

raaiq1 commented Nov 30, 2022

So the the main idea is to query the device timestamp right after a command group is submitted to the scheduler or to L0 and use that as submission time; currently there's no existing function in PI API to get the timestamp

In detail, I would to add a 'getDeviceCurrentTime' function to device_impl class, which would query the current time only for level-zero backend -support for other backends would be enabled in future PR - returning 0 otherwise. This function would simply call a pi function (piDeviceGetCurrentTIme or similar) to retrieve the current device timestamp. This function would be invoked under handler::finalize() right before the method returns here and here, so the submission time would be calculated only if the submission was successful (no exceptions thrown).

@raaiq1
Copy link
Contributor Author

raaiq1 commented Nov 30, 2022

What about other backends?
I plan to use a similar approach on getting the command_submit time for the other backends

@smaslov-intel
Copy link
Contributor

So the the main idea is to query the device timestamp right after a command group is submitted to the scheduler or to L0 and use that as submission time; currently there's no existing function in PI API to get the timestamp

In detail, I would to add a 'getDeviceCurrentTime' function to device_impl class, which would query the current time only for level-zero backend -support for other backends would be enabled in future PR - returning 0 otherwise. This function would simply call a pi function (piDeviceGetCurrentTIme or similar) to retrieve the current device timestamp. This function would be invoked under handler::finalize() right before the method returns here and here, so the submission time would be calculated only if the submission was successful (no exceptions thrown).

Why do you want to expose "current" device timestamp to SYCL (via PI)? Why not keep it internal to L0 Plugin, and record it whenever needed (either when a command put to a batch, or when that batch is submitted)?

@raaiq1
Copy link
Contributor Author

raaiq1 commented Nov 30, 2022

Why do you want to expose "current" device timestamp to SYCL (via PI)? Why not keep it internal to L0 Plugin, and record it whenever needed (either when a command put to a batch, or when that batch is submitted)?

In the edge case Vlad Romanov presented :

accessor HostAcc = Buf.get_access();
auto e=Q.submit(/*a kernel writing to Buf*/); // Command1
std::cout << e.get_profiling_info<info::event_profiling::command_submit>(); //line 3
HostAcc.~accessor();
e.wait(); //line 5

The host accessor would basically block enqueuing the Command1 kernel ( done through piEnqueueKernelLaunch) until the accessor is destroyed. So in this case, the kernel is not submitted to L0 until line 5, which is after the queue::submit() function returns and not according to specifications

@smaslov-intel
Copy link
Contributor

So in this case, the kernel is not submitted to L0 until line 5, which is after the queue::submit() function returns and not according to specifications

What exactly in the spec that is violated?

@raaiq1
Copy link
Contributor Author

raaiq1 commented Nov 30, 2022

What exactly in the spec that is violated?

Under table 37 of SYCL 2020, this is the description for info::event_profiling::command_submit :

Returns a timestamp telling when the associated command group was submitted to the queue. This is always some time after the command group function object returns and before the associated call to queue::submit returns.

--Edit: Sorry for being vague. So command_submit time must be calculated before submit function returns while in the case host_accessor, it would be calculated after the submit method returns using the L0 approach

@smaslov-intel
Copy link
Contributor

What exactly in the spec that is violated?

Under table 37 of SYCL 2020, this is the description for info::event_profiling::command_submit :

Returns a timestamp telling when the associated command group was submitted to the queue. This is always some time after the command group function object returns and before the associated call to queue::submit returns.

--Edit: Sorry for being vague. So command_submit time must be calculated before submit function returns while in the case host_accessor, it would be calculated after the submit method returns using the L0 approach

So this essentially requires to record the time of the queue::submit() and not the time of the time of the "command" submit to execution queue. This sounds odd to me especially that the timestamp is a device's time, and the device wouldn't be even involved at that point.

I think this is related to another ongoing discussion with @gmlueck that we admit that SYCL queue::submit() doesn't guarantee that commands are physically submitted, so what is the reason to request the timestamp of command_submit is calculated before submit function returns?

@gmlueck : can you add your perspective please?

@gmlueck
Copy link
Contributor

gmlueck commented Dec 2, 2022

It seems natural for the user to want to know the delay between when a command is submitted and when that command starts executing on hardware. The command_start timestamp tells when it starts executing, so we need command_submit to allow applications to compute the delay. Both timestamps must have the same "timebase", so that it is valid for applications to compute the difference.

We had discussions on how this could be implemented when we clarified this part of the spec. Here are the relevant notes:

Adding some notes here about how these profiling timestamps could be implemented on OpenCL. As noted above, info::event_profiling::command_start corresponds to CL_PROFILING_COMMAND_START and info::event_profiling::command_end corresponds to CL_PROFILING_COMMAND_END.

Implementing info::event_profiling::command_submit requires some way to get a timestamp from host code, but using the device clock. (More precisely, the timestamp must have the same "timebase" as the timestamps used for info::event_profiling::command_start and info::event_profiling::command_end.) On OpenCL, this can be achieved by using clGetDeviceAndHostTimer and clGetHostTimer.

The implementation needs to call clGetDeviceAndHostTimer just once for each device. This returns a synchronized pair of timestamps: one from the host clock and one from the device clock. Subtracting the two gives the delta between the two clocks. Then, the implementation needs to call clGetHostTimer each time a command is put on the SYCL queue. This retrieves the timestamp using the "host clock". The implementation can then add the delta (computed earlier) to convert this timestamp to the "device clock", and this is the timestamp to return for info::event_profiling::command_submit.

Note that these two APIs (clGetDeviceAndHostTimer and clGetHostTimer) were added in OpenCL 2.1, so it seems likely that this part of the SYCL API could not be implemented on a pure OpenCL 1.2 implementation. This does not seem problematic, though, especially since the SYCL profiling APIs are optional for each device.

When I wrote these notes, I recall that the same strategy would work for Level Zero, so I think there are similar Level Zero APIs to the OpenCL ones I list above.

@smaslov-intel
Copy link
Contributor

It seems natural for the user to want to know the delay between when a command is submitted and when that command starts executing on hardware.

Agreed. The remaining question is when to capture the "submit" time. Is it OK to do it at the time when piEnqueue* is called? (I'd like to avoid PI API extension for just capturing device's time)

@gmlueck
Copy link
Contributor

gmlueck commented Dec 2, 2022

The remaining question is when to capture the "submit" time. Is it OK to do it at the time when piEnqueue* is called? (I'd like to avoid PI API extension for just capturing device's time)

The SYCL spec says:

This is always some time after the command group function object returns and before the associated call to queue::submit returns

Does that correspond to when piEnqueue* is called?

@raaiq1
Copy link
Contributor Author

raaiq1 commented Dec 2, 2022

Does that correspond to when piEnqueue* is called?

There is an edge case where that's not necessarily true. If a host accessor is constructed before submitting a kernel (through queue::submit() ) with the same memory dependency, then piEnqueue of that kernel is deferred until the host accessor is destroyed.

@gmlueck
Copy link
Contributor

gmlueck commented Dec 2, 2022

There is an edge case where that's not necessarily true. If a host accessor is constructed before submitting a kernel (through queue::submit() ) with the same memory dependency, then piEnqueue of that kernel is deferred until the host accessor is destroyed.

I think it would make sense to capture the timestamp at the same point that the spec says even in this case. Is it hard to add a new PI API to capture the timestamp?

@smaslov-intel
Copy link
Contributor

There is an edge case where that's not necessarily true. If a host accessor is constructed before submitting a kernel (through queue::submit() ) with the same memory dependency, then piEnqueue of that kernel is deferred until the host accessor is destroyed.

I think it would make sense to capture the timestamp at the same point that the spec says even in this case. Is it hard to add a new PI API to capture the timestamp?

Technically, it is not hard, but it should be coordinated/approved for Unified Runtime API. It would probably be just a new info for existing piDeviceGetInfo. Just again, isn't it more useful to the end users to note the time when the kernel was actually summited by the host (piEnqueue* is called)?

@gmlueck
Copy link
Contributor

gmlueck commented Dec 2, 2022

It seems to me that the dependency on the host_accessor is no different than a dependency on a buffer through a regular accessor. In both cases, the kernel cannot be executed until the memory is available. Why should we treat the host_accessor case differently from the regular accessor case?

@smaslov-intel
Copy link
Contributor

There is an edge case where that's not necessarily true. If a host accessor is constructed before submitting a kernel (through queue::submit() ) with the same memory dependency, then piEnqueue of that kernel is deferred until the host accessor is destroyed.

I think it would make sense to capture the timestamp at the same point that the spec says even in this case. Is it hard to add a new PI API to capture the timestamp?

Technically, it is not hard, but it should be coordinated/approved for Unified Runtime API. It would probably be just a new info for existing piDeviceGetInfo. Just again, isn't it more useful to the end users to note the time when the kernel was actually summited by the host (piEnqueue* is called)?

@raaiq1 : At this time I am not opposed to the PI API extension but we will also need to get it into Unified Runtime, tag @kbenzie

@kbenzie
Copy link
Contributor

kbenzie commented Dec 6, 2022

This looks a lot like clGetDeviceAndHostTimer. Should we use an analogue of this instead?

@raaiq1
Copy link
Contributor Author

raaiq1 commented Dec 6, 2022

I'm okay with using an analogue of clGetDeviceAndHostTimer, maybe have the PI API analogue be piGetDeviceAndHostTime.

@kbenzie
Copy link
Contributor

kbenzie commented Dec 6, 2022

I'm okay with using an analogue of clGetDeviceAndHostTimer, maybe have the PI API analogue be piGetDeviceAndHostTime.

Apologies, I should have expanded. This was a question about using and analogue of clGetDeviceAndHostTimer in Unified Runtime. Sounds like that would be acceptable, if so I'll create a ticket to get that added to the UR spec.

Ideally, we would having the PI and UR entry points match but I don't think its a strict requirement.

@smaslov-intel
Copy link
Contributor

Ideally, we would having the PI and UR entry points match but I don't think its a strict requirement.

Please have them match to avoid redundant frictions in SYCL RT having to deal with both paths for some (long) time.

@kbenzie
Copy link
Contributor

kbenzie commented Dec 6, 2022

@smaslov-intel that works for me 👍

In that case, yes piGetDeviceAndHostTimer being the analogue would be my preference to better align with OpenCL where we can.

@gmlueck
Copy link
Contributor

gmlueck commented Dec 6, 2022

OpenCL has two different functions clGetDeviceAndHostTimer and clGetHostTimer. Shouldn't we add PI / UR interfaces for each?

@kbenzie
Copy link
Contributor

kbenzie commented Dec 6, 2022

Perhaps, although I'm only aware of a direct use case for the first at this time.

If you decide to also add a clGetHostTimer analogue please update oneapi-src/unified-runtime#88 with the request 🙂

@gmlueck
Copy link
Contributor

gmlueck commented Dec 6, 2022

Perhaps, although I'm only aware of a direct use case for the first at this time.

The algorithm I propose in the comment above would use them both. Is the code using that algorithm?

@raaiq1
Copy link
Contributor Author

raaiq1 commented Dec 6, 2022

The algorithm I propose in #7526 (comment) would use them both. Is the code using that algorithm?

The algorithm is a little bit more straight forward where submitting a task to the sycl queue just calls clGetDeviceAndHostTimer directly each time. I couldn't observe much of a performance difference compared to when using suggested method, but I'm open to use the suggested method

@gmlueck
Copy link
Contributor

gmlueck commented Dec 6, 2022

I couldn't observe much of a performance difference compared to when using #7526 (comment), but I'm open to use the suggested method

FWIW, OpenCL documents that clGetDeviceAndHostTimer may have high latency, which was why I suggested calling it only once:

Implementations may need to execute this query with a high latency in order to provide reasonable synchronization of the timestamps.

I'm not sure whether our implementation really does have higher latency, though.

Signed-off-by: Rauf, Rana <rana.rauf@intel.com>
Signed-off-by: Rauf, Rana <rana.rauf@intel.com>
Signed-off-by: Rauf, Rana <rana.rauf@intel.com>
@raaiq1 raaiq1 changed the title [SYCL] Implement command_submit L0 [SYCL] Fix event profiling for command_submit in L0 and other backends Dec 22, 2022
Signed-off-by: Rauf, Rana <rana.rauf@intel.com>
@@ -5988,7 +5988,6 @@ pi_result piEventGetProfilingInfo(pi_event Event, pi_profiling_info ParamName,
}
case PI_PROFILING_INFO_COMMAND_QUEUED:
case PI_PROFILING_INFO_COMMAND_SUBMIT:
// TODO: Support these when Level Zero supported is added.
return ReturnValue(uint64_t{0});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returning "0" is still not the right behavior. We should use zeDeviceGetGlobalTimestamps to record the time when commands were physically submitted to the device (inside plugin). I am OK if you just add a TODO comment and not fix it in this PR since that will still be unused (btw, also mention this that there are currently no users of this).

Signed-off-by: Rauf, Rana <rana.rauf@intel.com>
Signed-off-by: Rauf, Rana <rana.rauf@intel.com>
@mdtoguchi mdtoguchi removed the request for review from a team December 22, 2022 19:20
@raaiq1
Copy link
Contributor Author

raaiq1 commented Dec 22, 2022

ping to @intel/llvm-gatekeepers to merge

@raaiq1
Copy link
Contributor Author

raaiq1 commented Dec 22, 2022

ping @intel/llvm-reviewers-runtime , @intel/llvm-reviewers-cuda and @intel/dpcpp-esimd-reviewers for review

@AlexeySachkov AlexeySachkov removed request for a team December 23, 2022 09:11
@smaslov-intel
Copy link
Contributor

@againull : please review/merge this

@againull againull self-requested a review January 3, 2023 19:08
@bader
Copy link
Contributor

bader commented Jan 5, 2023

ping @intel/llvm-reviewers-cuda and @intel/dpcpp-esimd-reviewers for review

@againull
Copy link
Contributor

ESIMD changes are trivial, so merging this PR.

@againull againull merged commit 71d7797 into intel:sycl Jan 10, 2023
npmiller added a commit to npmiller/llvm that referenced this pull request Feb 6, 2023
This patch moves the CUDA context from the PI context to the PI device,
and switches to always using the primary context.

CUDA contexts are different from SYCL contexts in that they're tied to a
single device, and that they are required to be active on a thread for
most calls to the CUDA driver API.

As shown in intel#8124 and intel#7526 the current mapping of
CUDA context to PI context, causes issues for device based entry points
that still need to call the CUDA APIs, we have workarounds to solve that
but they're a bit hacky, inefficient, and have a lot of edge case
issues.

The peer to peer interface proposal in intel#6104, is also device
based, but enabling peer to peer for CUDA is done on the CUDA contexts,
so the current mapping would make it difficult to implement.

So this patch solves most of these issues by decoupling the CUDA context
from the SYCL context, and simply managing the CUDA contexts in the
devices, it also changes the CUDA context management to always use the
primary context.

This approach as a number of advantages:

* Use of the primary context is recommended by Nvidia
* Simplifies the CUDA context management in the plugin
* Available CUDA context in device based entry points
* Likely more efficient in the general case, with less opportunities to
  accidentally cause costly CUDA context switches.
* Easier and likely more efficient interactions with CUDA runtime
  applications.
* Easier to expose P2P capabilities
* Easier to support multiple devices in a SYCL context

It does have a few drawbacks from the previous approach:

* Drops support for `make_context` interop, no sensible "native handle"
  to pass in (`get_native` is still supported fine).
* No opportunity for users to separate their work into different CUDA
  contexts. It's unclear if there's any actual use case for this, it
  seems very uncommon in CUDA codebases to have multiple CUDA contexts
  for a single CUDA device in the same process.

So overall I believe this should be a net benefit in general, and we
could revisit if we run into an edge case that would need more fine
grained CUDA context management.
bader pushed a commit that referenced this pull request Feb 9, 2023
This patch moves the CUDA context from the PI context to the PI device,
and switches to always using the primary context.

CUDA contexts are different from SYCL contexts in that they're tied to a
single device, and that they are required to be active on a thread for
most calls to the CUDA driver API.

As shown in #8124 and #7526 the current mapping of
CUDA context to PI context, causes issues for device based entry points
that still need to call the CUDA APIs, we have workarounds to solve that
but they're a bit hacky, inefficient, and have a lot of edge case
issues.

The peer to peer interface proposal in #6104, is also device
based, but enabling peer to peer for CUDA is done on the CUDA contexts,
so the current mapping would make it difficult to implement.

So this patch solves most of these issues by decoupling the CUDA context
from the SYCL context, and simply managing the CUDA contexts in the
devices, it also changes the CUDA context management to always use the
primary context.

This approach as a number of advantages:

* Use of the primary context is recommended by Nvidia
* Simplifies the CUDA context management in the plugin
* Available CUDA context in device based entry points
* Likely more efficient in the general case, with less opportunities to
accidentally cause costly CUDA context switches.
* Easier and likely more efficient interactions with CUDA runtime
applications.
* Easier to expose P2P capabilities
* Easier to support multiple devices in a SYCL context

It does have a few drawbacks from the previous approach:

* Drops support for `make_context` interop, no sensible "native handle"
to pass in (`get_native` is still supported fine).
* No opportunity for users to separate their work into different CUDA
contexts. It's unclear if there's any actual use case for this, it seems
very uncommon in CUDA codebases to have multiple CUDA contexts for a
single CUDA device in the same process.

So overall I believe this should be a net benefit in general, and we
could revisit if we run into an edge case that would need more fine
grained CUDA context management.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants