-
Notifications
You must be signed in to change notification settings - Fork 738
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SYCL] Fix event profiling for command_submit in L0 and other backends #7526
Conversation
Signed-off-by: Rauf, Rana <rana.rauf@intel.com>
@smaslov-intel For querying the device time at the plugin layer. Is it justified to have a separate method: |
Please explain the entire design you have in mind to make a call. Why is the new query needed and how it will be used for obtaining "command_submit" time from Level Zero backend? What about other backends? |
So the the main idea is to query the device timestamp right after a command group is submitted to the scheduler or to L0 and use that as submission time; currently there's no existing function in PI API to get the timestamp In detail, I would to add a 'getDeviceCurrentTime' function to |
|
Why do you want to expose "current" device timestamp to SYCL (via PI)? Why not keep it internal to L0 Plugin, and record it whenever needed (either when a command put to a batch, or when that batch is submitted)? |
In the edge case Vlad Romanov presented :
The host accessor would basically block enqueuing the |
What exactly in the spec that is violated? |
Under table 37 of SYCL 2020, this is the description for
--Edit: Sorry for being vague. So |
So this essentially requires to record the time of the queue::submit() and not the time of the time of the "command" submit to execution queue. This sounds odd to me especially that the timestamp is a device's time, and the device wouldn't be even involved at that point. I think this is related to another ongoing discussion with @gmlueck that we admit that SYCL @gmlueck : can you add your perspective please? |
It seems natural for the user to want to know the delay between when a command is submitted and when that command starts executing on hardware. The We had discussions on how this could be implemented when we clarified this part of the spec. Here are the relevant notes:
When I wrote these notes, I recall that the same strategy would work for Level Zero, so I think there are similar Level Zero APIs to the OpenCL ones I list above. |
Agreed. The remaining question is when to capture the "submit" time. Is it OK to do it at the time when |
The SYCL spec says:
Does that correspond to when piEnqueue* is called? |
There is an edge case where that's not necessarily true. If a host accessor is constructed before submitting a kernel (through |
I think it would make sense to capture the timestamp at the same point that the spec says even in this case. Is it hard to add a new PI API to capture the timestamp? |
Technically, it is not hard, but it should be coordinated/approved for Unified Runtime API. It would probably be just a new info for existing piDeviceGetInfo. Just again, isn't it more useful to the end users to note the time when the kernel was actually summited by the host ( |
It seems to me that the dependency on the host_accessor is no different than a dependency on a buffer through a regular accessor. In both cases, the kernel cannot be executed until the memory is available. Why should we treat the host_accessor case differently from the regular accessor case? |
@raaiq1 : At this time I am not opposed to the PI API extension but we will also need to get it into Unified Runtime, tag @kbenzie |
This looks a lot like clGetDeviceAndHostTimer. Should we use an analogue of this instead? |
I'm okay with using an analogue of clGetDeviceAndHostTimer, maybe have the PI API analogue be |
Apologies, I should have expanded. This was a question about using and analogue of clGetDeviceAndHostTimer in Unified Runtime. Sounds like that would be acceptable, if so I'll create a ticket to get that added to the UR spec. Ideally, we would having the PI and UR entry points match but I don't think its a strict requirement. |
Please have them match to avoid redundant frictions in SYCL RT having to deal with both paths for some (long) time. |
@smaslov-intel that works for me 👍 In that case, yes |
OpenCL has two different functions clGetDeviceAndHostTimer and clGetHostTimer. Shouldn't we add PI / UR interfaces for each? |
Perhaps, although I'm only aware of a direct use case for the first at this time. If you decide to also add a |
The algorithm I propose in the comment above would use them both. Is the code using that algorithm? |
The algorithm is a little bit more straight forward where submitting a task to the sycl queue just calls |
FWIW, OpenCL documents that
I'm not sure whether our implementation really does have higher latency, though. |
…another_design
Signed-off-by: Rauf, Rana <rana.rauf@intel.com>
Signed-off-by: Rauf, Rana <rana.rauf@intel.com>
Signed-off-by: Rauf, Rana <rana.rauf@intel.com>
Signed-off-by: Rauf, Rana <rana.rauf@intel.com>
@@ -5988,7 +5988,6 @@ pi_result piEventGetProfilingInfo(pi_event Event, pi_profiling_info ParamName, | |||
} | |||
case PI_PROFILING_INFO_COMMAND_QUEUED: | |||
case PI_PROFILING_INFO_COMMAND_SUBMIT: | |||
// TODO: Support these when Level Zero supported is added. | |||
return ReturnValue(uint64_t{0}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Returning "0" is still not the right behavior. We should use zeDeviceGetGlobalTimestamps
to record the time when commands were physically submitted to the device (inside plugin). I am OK if you just add a TODO comment and not fix it in this PR since that will still be unused (btw, also mention this that there are currently no users of this).
Signed-off-by: Rauf, Rana <rana.rauf@intel.com>
Signed-off-by: Rauf, Rana <rana.rauf@intel.com>
ping to @intel/llvm-gatekeepers to merge |
ping @intel/llvm-reviewers-runtime , @intel/llvm-reviewers-cuda and @intel/dpcpp-esimd-reviewers for review |
@againull : please review/merge this |
ping @intel/llvm-reviewers-cuda and @intel/dpcpp-esimd-reviewers for review |
ESIMD changes are trivial, so merging this PR. |
This patch moves the CUDA context from the PI context to the PI device, and switches to always using the primary context. CUDA contexts are different from SYCL contexts in that they're tied to a single device, and that they are required to be active on a thread for most calls to the CUDA driver API. As shown in intel#8124 and intel#7526 the current mapping of CUDA context to PI context, causes issues for device based entry points that still need to call the CUDA APIs, we have workarounds to solve that but they're a bit hacky, inefficient, and have a lot of edge case issues. The peer to peer interface proposal in intel#6104, is also device based, but enabling peer to peer for CUDA is done on the CUDA contexts, so the current mapping would make it difficult to implement. So this patch solves most of these issues by decoupling the CUDA context from the SYCL context, and simply managing the CUDA contexts in the devices, it also changes the CUDA context management to always use the primary context. This approach as a number of advantages: * Use of the primary context is recommended by Nvidia * Simplifies the CUDA context management in the plugin * Available CUDA context in device based entry points * Likely more efficient in the general case, with less opportunities to accidentally cause costly CUDA context switches. * Easier and likely more efficient interactions with CUDA runtime applications. * Easier to expose P2P capabilities * Easier to support multiple devices in a SYCL context It does have a few drawbacks from the previous approach: * Drops support for `make_context` interop, no sensible "native handle" to pass in (`get_native` is still supported fine). * No opportunity for users to separate their work into different CUDA contexts. It's unclear if there's any actual use case for this, it seems very uncommon in CUDA codebases to have multiple CUDA contexts for a single CUDA device in the same process. So overall I believe this should be a net benefit in general, and we could revisit if we run into an edge case that would need more fine grained CUDA context management.
This patch moves the CUDA context from the PI context to the PI device, and switches to always using the primary context. CUDA contexts are different from SYCL contexts in that they're tied to a single device, and that they are required to be active on a thread for most calls to the CUDA driver API. As shown in #8124 and #7526 the current mapping of CUDA context to PI context, causes issues for device based entry points that still need to call the CUDA APIs, we have workarounds to solve that but they're a bit hacky, inefficient, and have a lot of edge case issues. The peer to peer interface proposal in #6104, is also device based, but enabling peer to peer for CUDA is done on the CUDA contexts, so the current mapping would make it difficult to implement. So this patch solves most of these issues by decoupling the CUDA context from the SYCL context, and simply managing the CUDA contexts in the devices, it also changes the CUDA context management to always use the primary context. This approach as a number of advantages: * Use of the primary context is recommended by Nvidia * Simplifies the CUDA context management in the plugin * Available CUDA context in device based entry points * Likely more efficient in the general case, with less opportunities to accidentally cause costly CUDA context switches. * Easier and likely more efficient interactions with CUDA runtime applications. * Easier to expose P2P capabilities * Easier to support multiple devices in a SYCL context It does have a few drawbacks from the previous approach: * Drops support for `make_context` interop, no sensible "native handle" to pass in (`get_native` is still supported fine). * No opportunity for users to separate their work into different CUDA contexts. It's unclear if there's any actual use case for this, it seems very uncommon in CUDA codebases to have multiple CUDA contexts for a single CUDA device in the same process. So overall I believe this should be a net benefit in general, and we could revisit if we run into an edge case that would need more fine grained CUDA context management.
According to SYCL 2020 specification, the timeframe to calculated command submission time:
Currently, command submission time is calculated when a command is submitted to the underlying device which may not necessarily be before
queue::submit
returns, e.ghost_accessor
blocking command submission until it's destroyed.This patch changes that timeframe to be always before
queue::submit
returns, specifically right after being persisted by the graph builder and before being enqueued by graph processor.