Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SYCL][DOC] Initial commit of oneapi extension proposal for adding P2P #6104

Merged
merged 14 commits into from
Mar 3, 2023

Conversation

jbrodman
Copy link
Contributor

@jbrodman jbrodman commented May 5, 2022

...mechanisms to SYCL

Signed-off-by: James Brodman james.brodman@intel.com

…echanisms to SYCL

Signed-off-by: James Brodman <james.brodman@intel.com>
@jbrodman jbrodman requested a review from a team as a code owner May 5, 2022 19:26
@JackAKirk
Copy link
Contributor

JackAKirk commented May 9, 2022

Thanks for posting this. I have a few questions:

  1. Just to be completely clear: Is the implication that if ext_oneapi_can_access_peer(device_b) returns true when called from device_a then users can have kernels running on device_a that can access a USM memory pointer, "ptr_b", allocated on device_b and this will lead to a "peer access" (rather than a "peer copy") from device_a to the memory located on device_b?

  2. Do you have plans to support peer memory copies (as opposed to "peer accesses") from "ptr_b" to "ptr_a" on device_a within this proposal at a later date?

With regard to this I read through the proposed clarification to Peer access rules in SYCL 2020 next here: gmlueck/SYCL-Docs@76e1b44 ; it seems that references to "migratable usm" have been removed. I didn't read it thoroughly but I didn't find an explicit mention of peer to peer USM copies (as opposed to Peer to Peer access). cc @gmlueck: Is this intentional?

  1. Do you have plans to support P2P for buffers:
    With regard to buffers, currently if one instantiates a buffer then accesses it in a queue using device_a, followed by an otherQueue using a device_b then the scheduler currently copies the memory from device_a to host then host to device_b. E.g. when using:
myQueue.submit([&](handler &cgh) {
      auto read = buffer_from_1D.get_access<access::mode::read>(cgh);
      auto write = buffer_to_1D.get_access<access::mode::write>(cgh);
      cgh.parallel_for<class copyH2D_1D>(
          buffer_from_1D.get_range(),
          [=](id<1> index) { write[index] = read[index] * -1; });
    });
    myQueue.wait();

    otherQueue.submit([&](handler &cgh) {
      auto read = buffer_from_1D.get_access<access::mode::read>(cgh);
      auto write = buffer_to_1D.get_access<access::mode::write>(cgh);
      cgh.parallel_for<class copyH2D_1D_2nd>(
          buffer_from_1D.get_range(),
          [=](id<1> index) { write[index] = read[index] * 10; });
    });

The simplest way to leverage Peer memory in this case is to allow a direct memory copy from device_a to device_b (allowing it only when the devices share a context). The implementation could be very similar to this scrapped implementation for a similar peer to peer copy for devices in different contexts (now deemed not allowed): #4401

  1. Do you think that Peer access should be supported/is sensible for buffers also? If so do you have any thoughts on how this would be supported within the SYCL programming model?

Thanks

@jbrodman
Copy link
Contributor Author

  1. Just to be completely clear: Is the implication that if ext_oneapi_can_access_peer(device_b) returns true when called from device_a then users can have kernels running on device_a that can access a USM memory pointer, "ptr_b", allocated on device_b and this will lead to a "peer access" (rather than a "peer copy") from device_a to the memory located on device_b?
    Yes - a USM device pointer allocated on device_b will now be dereferenceable in kernels on device_a.
  1. Do you have plans to support peer memory copies (as opposed to "peer accesses") from "ptr_b" to "ptr_a" on device_a within this proposal at a later date?
    That seems like an implementation detail. There's nothing that says the SYCL RT can't do this today.
  1. Do you have plans to support P2P for buffers:
    With regard to buffers, currently if one instantiates a buffer then accesses it in a queue using device_a, followed by an otherQueue using a device_b then the scheduler currently copies the memory from device_a to host then host to device_b.
    Not at this time.
  1. Do you think that Peer access should be supported/is sensible for buffers also? If so do you have any thoughts on how this would be supported within the SYCL programming model?
    No. P2P type capabilities better align with USM Device Memory. As per [SYCL] Using NoSignedWrap decoration without declaring SPV_KHR_no_integer_wrap_decoration #2, I think a RT could do optimized peer copies using the existing copy methods.

Signed-off-by: James Brodman <james.brodman@intel.com>
Signed-off-by: James Brodman <james.brodman@intel.com>
it indicates that this device may perform atomic operationson USM device memory
allocations located on the `peer` device when peer access is enabled to that
device. If the query returns false, attempting to perform atomic operations on
`peer` memory will have undefined behavior.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The core SYCL spec makes a distinction between "atomic operations" and "concurrent access". The Level Zero driver has separate queries for these two concepts. We need to clarify what atomics_supported means. I think it should mean that both atomic operations and concurrent access is supported, which is consistent with the current wording in the SYCL spec for the usm_atomic_shared_allocations aspect.

This is an area we are debating in general, though, so we may end up making two different queries for these concepts.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think concurrent access comes into play here - I think it's only (pseudocode) atomicAdd(ptr, val)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Atomic operations only make sense if two things can access the memory concurrently. I guess there are two possible interpretations for what atomics_supported means:

  1. This device and peer device can concurrently access the device USM and do atomic operations on that memory. These operations are atomic w.r.t. code running on the two devices.

  2. This device can access device USM from peer, but it cannot access it concurrently with peer. Atomic operations are supported, but only between work-items running on this device.

I was originally thinking the query meant (1), but your comment makes me think that maybe you intend (2)?

Copy link
Contributor

@Pennycook Pennycook Sep 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another thing that we should pay attention to here is the concept of memory scope.

If the device and peer can use atomics to concurrently update the same memory, then both devices will need to list memory_scope::system in info::device::atomic_memory_scope_capabilities. Both devices will need to use atomics with memory_scope::system when concurrently accessing the memory to avoid a data race.

If the device is only accessing peer's memory atomically but not concurrently with peer, it can use atomics with memory_scope::device. If peer accesses the same memory concurrently, that's a data race.

I don't know whether it's better to use the atomics & concurrent distinction or to work in some concept of scope, but I agree with Greg that this needs to clarify exactly what is guaranteed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it help to add a new extended memory scope like memory_scope::ext_oneapi_peer_devices?

Co-authored-by: Greg Lueck <gregory.m.lueck@intel.com>
@JackAKirk
Copy link
Contributor

I think this looks quite good from the point of view of the cuda backend (apart from the one issue I describe below). I can try a simple implementation to make sure there are no other issues with CUDA implementing this.

There is one point that I'd like to clarify: Currently some backends (e.g. level_zero is implemented already) can do direct P2P copies for buffers. I think we should consider whether it is required (or not) for users to call ext_oneapi_enable_peer_access first before this buffer P2P opt is enabled.

In the CUDA backend in order to enable P2P copy (as well as enable P2P access) of memory from one device to another it would be necessary to call cuCtxEnablePeerAccess: see https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__PEER__ACCESS.html#group__CUDA__PEER__ACCESS_1g0889ec6728e61c05ed359551d67b3f5a

This means that if we want to disentangle the buffer P2P optimization from the USM P2P access feature, when the runtime does a P2P buffer copy we would need to have an implementation of the buffer P2P copy optimization in the CUDA backend do:

ext_oneapi_enable_peer_access (calls cuCtxEnablePeerAccess in CUDA backend)
Perform P2P copy
ext_oneapi_disable_peer_access (calls cuCtxDisablePeerAccess in CUDA backend)

I think this would mean that in order to ensure that the (buffer) enabled peer to peer access doesn't interfere with a users usage of this USM extension (via the expected result of ext_oneapi_can_access_peer), we would have to make the buffer copy block the host thread, to prevent it from calling the ext_oneapi_can_access_peer that would temporarily return true during the time that the buffer is performing the copy. Alternatively we could miss the ext_oneapi_disable_peer_access call at the end of the buffer copy in which case ext_oneapi_can_access_peer(peer, ext::oneapi::peer_access::access_enabled) could return true without the user explicitly calling ext_oneapi_enable_peer_access.

Perhaps this is only an issue for the CUDA backend?, but this sounds pretty messy already, and I think it could be a good idea to avoid these issues by connecting this USM extension with the expected behavior of any buffer P2P copy optimization such that
we could state:

The user is required to call ext_oneapi_enable_peer_access to switch on any available buffer direct P2P copy optimization between different devices.

What do you think @jbrodman @gmlueck? Would this connection between USM peer access and buffer peer copy be undesirable for the level_zero backend?

This extension adds support for mechanisms to query and enable support for
direct memory access between peer devices in a system.
In particular, this allows one device to directly access USM Device
allocations for a peer device in the same context.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If two devices with P2P capabilities are placed in the same context, shouldn't this be implicitly enabled?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There has been a lot of discussion about what a context means. I think our current consensus is that it does not provide any guarantee about P2P access between devices. Therefore, placing two devices in the same context does not provide any guarantee that USM memory allocated for one of those devices is accessible from another device in that same context.

See the discussion in internal Khronos issue 563.

@gmlueck
Copy link
Contributor

gmlueck commented Oct 24, 2022

What do you think @jbrodman @gmlueck? Would this connection between USM peer access and buffer peer copy be undesirable for the level_zero backend?

Two comments:

  • Would the buffer-copy optimization really call cuCtxEnablePeerAccess/ cuCtxDisablePeerAccess for each copy operation? Somehow I thought these operations could be potentially slow, so enabling / disabling P2P on each copy could have disastrous performance. (I have no first-hand experience, so I might be wrong about the enable / disable APIs being slow.)

  • Your comments about the interaction between buffer-copy optimization and the P2P API assume that APIs in this extension are implemented by directly calling CUDA APIs. Instead, the extension could maintain some internal state that virtualizes the enable / disable state. This would allow ext_oneapi_can_access_peer(peer, ext::oneapi::peer_access::access_enabled) to return false even if P2P access was currently enable due to the buffer-copy optimization.

@JackAKirk
Copy link
Contributor

JackAKirk commented Oct 25, 2022

* Your comments about the interaction between buffer-copy optimization and the P2P API assume that APIs in this extension are implemented by directly calling CUDA APIs.  Instead, the extension could maintain some internal state that virtualizes the enable / disable state.  This would allow `ext_oneapi_can_access_peer(peer, ext::oneapi::peer_access::access_enabled)` to return `false` even if P2P access was currently enable due to the buffer-copy optimization.

Sure we could deal with that in the cuda backend if we can infer the location of the pointers provided to q.memcpy. I don't immediately know how we can most straightforwardly do that because I don't see a cu device API query that takes a pointer and returns whether it is device or host allocated, but I can ask if anyone knows if this exists (of course we can defer to some more complex/slower/heavier implementation to retrieve locations (Host or Device) of pointers, but I don't think anyone will argue this is desirable if avoidable).

We would need to be able to know this so that we could move from the current cuda implementation of q.memcpy that uses cuMemcpyAsync which infers the type of copy (D2H H2D D2D) from the pointer addresses. If we managed to do this we could feasibly (in the D2D case) use the information as provided in ext_oneapi_can_access_peer to call cuMemcpyDtoHAsync then cuMemcpyHtoDAsync instead of cuMemcpyAsync (or cuMemcpyPeer) if ext_oneapi_can_access_peer returns false for the access query (or cuMemcpyDtoDAsync if the D2D copy is intra-device). If cuCtxEnablePeerAccess has been called (and no error returned and the corresponding Disable API not called), then any call to cuMemcpyAsync (or cuMemcpyPeer) will do a P2P copy when the src and device are on different peer access enabled devices. If cuCtxEnablePeerAccess has not been called then any call to cuMemcpyAsync (or cuMemcpyPeer) will do D2H H2D instead.

Now if we did this note that this would not completely remove interactions between buffer P2P and USM P2P, although the interactions that remain I think would only be for rather unusual use cases for which we need to consider the limitations of the number of active (active means cuCtxEnablePeerAccess has been called and returned without error) peer connections:

For CUDA devices this maximum number of peers is set by either:

  • On non-NVSwitch enabled systems, each device can support a system-wide maximum of eight peer connections.
  • When NVSwitch is used the maximum number of peers per device is given in a table in the following link: https://www.nvidia.com/en-gb/data-center/nvlink/ 6 for volta, 12 for Ampere, 18 for Hopper.

So if we had a hypothetical system of 9 non NVSwitch peers, 8 active connections for a single device for buffer P2P, then any user calls to ext_oneapi_enable_peer_access to enable a connection from the device to the remaining peer will have to deal with the CUDA_ERROR_TOO_MANY_PEERS thrown by cuCtxEnablePeerAccess gracefully and return false to ext_oneapi_enable_peer_access, until one of the peer connections used by buffer P2P is disabled.

I guess that there is a similar max peer constraint for level_zero?

* Would the buffer-copy optimization really call cuCtxEnablePeerAccess/ cuCtxDisablePeerAccess for each copy operation?  Somehow I thought these operations could be potentially slow, so enabling / disabling P2P on each copy could have disastrous performance.  (I have no first-hand experience, so I might be wrong about the enable / disable APIs being slow.)

My testing experience has been that execution times of cuCtxEnablePeerAccess and cuCtxDisablePeerAccess are quite a lot shorter than even very small copies (smallest I checked last year was of order 10-100 bytes). Note that I was not advocating calling cuCtxEnablePeerAccess and cuCtxDisablePeerAccess for every buffer P2P copy!. I just want to put the behaviour/constraints of the cuda peer functionality out there for you, and it is easier to do this by way of examples. There is also the question of the hip peer functionality. At the level of the runtime the APIs used in HIP match CUDA for the peer capabilities, however we haven't looked into the lower level implementation. It's probably a good first guess that the constraints of the HIP peer implementation will match cuda although this isn't guaranteed until we check.

For completeness the other thing to mention for the cuda case is that, unlike the corresponding level_zero case, peer access is granted between cuContexts rather than cuDevices: see the declaration of cuCtxEnablePeerAccess. This is an implementation detail (I'm not saying this should motivate breaking your proposed extension) but it is probably worth mentioning how this interacts with the current state of DPC++ runtime and PI_CUDA. Basically the fact that we can have multiple sycl::contexts mapped to a single sycl::device, and then multiple cuContexts per cuDevice, and the fact that sycl::device knows nothing about sycl::context currently, complicates things (Of course there are ways we can work around this).

As an aside this is another small issue motivating why we are interested in the context questions that still don't appear to have clear answers:
Does the (added complexity of) the sycl::context abstraction (and the constraints that it imposes) have a real use case for:
a) DPC++ target users (SYCL spec + DPC++ extensions scope)
b) wider subset of all SYCL target users (SYCL spec scope)

For the purposes of easily testing the performance of queue.memcpy for different numbers of concurrent Peer connections I had to use a free function to switch on the access:

sycl::ext::oneapi::experimental::ext_oneapi_enable_peer_access(
const queue& active, const queue& peer);
}

This works because sycl::queue knows about sycl::context. Again I'm not suggesting the Device member function API change at all; rather I'm providing this information to try to represent the cuda peer functionality, and point out how it interacts with sycl::device, queue, and context concepts.
Hopefully once we know all requirements of level_zero, cuda and more, it can make the most natural solution more obvious.

Signed-off-by: James Brodman <james.brodman@intel.com>
Signed-off-by: James Brodman <james.brodman@intel.com>
npmiller added a commit to npmiller/llvm that referenced this pull request Feb 6, 2023
This patch moves the CUDA context from the PI context to the PI device,
and switches to always using the primary context.

CUDA contexts are different from SYCL contexts in that they're tied to a
single device, and that they are required to be active on a thread for
most calls to the CUDA driver API.

As shown in intel#8124 and intel#7526 the current mapping of
CUDA context to PI context, causes issues for device based entry points
that still need to call the CUDA APIs, we have workarounds to solve that
but they're a bit hacky, inefficient, and have a lot of edge case
issues.

The peer to peer interface proposal in intel#6104, is also device
based, but enabling peer to peer for CUDA is done on the CUDA contexts,
so the current mapping would make it difficult to implement.

So this patch solves most of these issues by decoupling the CUDA context
from the SYCL context, and simply managing the CUDA contexts in the
devices, it also changes the CUDA context management to always use the
primary context.

This approach as a number of advantages:

* Use of the primary context is recommended by Nvidia
* Simplifies the CUDA context management in the plugin
* Available CUDA context in device based entry points
* Likely more efficient in the general case, with less opportunities to
  accidentally cause costly CUDA context switches.
* Easier and likely more efficient interactions with CUDA runtime
  applications.
* Easier to expose P2P capabilities
* Easier to support multiple devices in a SYCL context

It does have a few drawbacks from the previous approach:

* Drops support for `make_context` interop, no sensible "native handle"
  to pass in (`get_native` is still supported fine).
* No opportunity for users to separate their work into different CUDA
  contexts. It's unclear if there's any actual use case for this, it
  seems very uncommon in CUDA codebases to have multiple CUDA contexts
  for a single CUDA device in the same process.

So overall I believe this should be a net benefit in general, and we
could revisit if we run into an edge case that would need more fine
grained CUDA context management.
@JackAKirk
Copy link
Contributor

JackAKirk commented Feb 8, 2023

@jbrodman @gmlueck
It looks like #8197 will be merged soon. This PR makes it much easier for the nvidia backend to implement this extension. I've implemented this P2P proposal in the nvidia backend on top of #8197 here: https://github.com/JackAKirk/llvm/tree/P2P-primary-ctxt.
When #8197 is merged I can open a PR for this draft implementation.

Here are some small issues I came across:

I will add some corresponding tests to exhibit all this functionality soon and link you them. I thought I should just let you know what I found from my investigations. Apart from the fact that peer_access::access_enabled cannot be natively supported by Nvidia, the extension spec looks good for the cuda backend.

bader pushed a commit that referenced this pull request Feb 9, 2023
This patch moves the CUDA context from the PI context to the PI device,
and switches to always using the primary context.

CUDA contexts are different from SYCL contexts in that they're tied to a
single device, and that they are required to be active on a thread for
most calls to the CUDA driver API.

As shown in #8124 and #7526 the current mapping of
CUDA context to PI context, causes issues for device based entry points
that still need to call the CUDA APIs, we have workarounds to solve that
but they're a bit hacky, inefficient, and have a lot of edge case
issues.

The peer to peer interface proposal in #6104, is also device
based, but enabling peer to peer for CUDA is done on the CUDA contexts,
so the current mapping would make it difficult to implement.

So this patch solves most of these issues by decoupling the CUDA context
from the SYCL context, and simply managing the CUDA contexts in the
devices, it also changes the CUDA context management to always use the
primary context.

This approach as a number of advantages:

* Use of the primary context is recommended by Nvidia
* Simplifies the CUDA context management in the plugin
* Available CUDA context in device based entry points
* Likely more efficient in the general case, with less opportunities to
accidentally cause costly CUDA context switches.
* Easier and likely more efficient interactions with CUDA runtime
applications.
* Easier to expose P2P capabilities
* Easier to support multiple devices in a SYCL context

It does have a few drawbacks from the previous approach:

* Drops support for `make_context` interop, no sensible "native handle"
to pass in (`get_native` is still supported fine).
* No opportunity for users to separate their work into different CUDA
contexts. It's unclear if there's any actual use case for this, it seems
very uncommon in CUDA codebases to have multiple CUDA contexts for a
single CUDA device in the same process.

So overall I believe this should be a net benefit in general, and we
could revisit if we run into an edge case that would need more fine
grained CUDA context management.
@JackAKirk
Copy link
Contributor

JackAKirk commented Feb 9, 2023

It could be good to include or link to example usage in this doc: something based on/ similar to:


// note: practically it could also be good to provide clear directions to
// documentation showing users how to make sure they are constructing queues
// using distinct devices.

auto Dev0 = Queues[0].get_device();
auto Dev1 = Queues[1].get_device();

int *arr0 = malloc<int>(N, Queues[0], usm::alloc::device);
int *arr1 = malloc<int>(N, Queues[1], usm::alloc::device);

// note: in real use would obviously load/set arr0/arr1 with meaningful data.

if (Dev0.ext_oneapi_can_access_peer(
        Dev1, sycl::ext::oneapi::peer_access::access_supported)) {
  Dev0.ext_oneapi_enable_peer_access(Dev1);
}

// Copy will be made by P2P if supported
Queues[0].copy(arr0, arr1, N).wait();

// Copy will not be made by P2P since it has not been enabled
Queues[1].copy(arr1, arr0, N).wait();

if (Dev0.ext_oneapi_can_access_peer(
        Dev1, sycl::ext::oneapi::peer_access::access_supported)) {
  Dev0.ext_oneapi_disable_peer_access(Dev1);
}

// Copy will not be made by P2P if supported because it has now been disabled
Queues[0].copy(arr0, arr1, N).wait();

And then another example for peer access (the above is an example of peer copy). Also I have apparently discovered that P2P via nvlink with Nvidia hardware is bi-directional, such that Dev0.ext_oneapi_enable_peer_access(Dev1); also enables Dev1 to do a P2P copy to Dev0. I think that this is not the case via pcie, but I need to check this.

jbrodman and others added 4 commits February 28, 2023 15:38
Co-authored-by: Greg Lueck <gregory.m.lueck@intel.com>
Co-authored-by: Greg Lueck <gregory.m.lueck@intel.com>
Co-authored-by: Greg Lueck <gregory.m.lueck@intel.com>
Signed-off-by: James Brodman <james.brodman@intel.com>
Copy link
Contributor

@gmlueck gmlueck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Just a couple spelling mistakes.

jbrodman and others added 2 commits March 2, 2023 13:30
Co-authored-by: Greg Lueck <gregory.m.lueck@intel.com>
Co-authored-by: Greg Lueck <gregory.m.lueck@intel.com>
Copy link
Contributor

@gmlueck gmlueck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@JackAKirk
Copy link
Contributor

Hi @jbrodman @gmlueck

Just to confirm so I can update the implementation:

  • Is the extension OK'd now by syclomatic?

  • I take it that you still want access_enabled? That is fine I just wanted to double check this is the case now that we know it isn't natively supported by cuda unlike what the cuda docs imply.

Signed-off-by: James Brodman <james.brodman@intel.com>
@steffenlarsen steffenlarsen merged commit b5bce77 into intel:sycl Mar 3, 2023
@JackAKirk JackAKirk self-requested a review March 3, 2023 16:08
namespace oneapi {
enum class peer_access {
access_supported,
access_enabled,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

access_enabled was removed below, but not here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops!

dm-vodopyanov pushed a commit that referenced this pull request Jul 10, 2023
This implements the current extension doc from
#6104 in the CUDA backend only.

Fixes #7543.
Fixes #6749.

---------

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
Co-authored-by: Nicolas Miller <nicolas.miller@codeplay.com>
Co-authored-by: JackAKirk <chezjakirk@gmail.com>
Co-authored-by: Steffen Larsen <steffen.larsen@intel.com>
veselypeta pushed a commit to veselypeta/llvm that referenced this pull request Sep 21, 2023
)

This implements the current extension doc from
intel#6104 in the CUDA backend only.

Fixes intel#7543.
Fixes intel#6749.

---------

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
Co-authored-by: Nicolas Miller <nicolas.miller@codeplay.com>
Co-authored-by: JackAKirk <chezjakirk@gmail.com>
Co-authored-by: Steffen Larsen <steffen.larsen@intel.com>
fabiomestre pushed a commit to fabiomestre/llvm that referenced this pull request Sep 26, 2023
)

This implements the current extension doc from
intel#6104 in the CUDA backend only.

Fixes intel#7543.
Fixes intel#6749.

---------

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
Co-authored-by: Nicolas Miller <nicolas.miller@codeplay.com>
Co-authored-by: JackAKirk <chezjakirk@gmail.com>
Co-authored-by: Steffen Larsen <steffen.larsen@intel.com>
fabiomestre pushed a commit to fabiomestre/unified-runtime that referenced this pull request Sep 26, 2023
This implements the current extension doc from
intel/llvm#6104 in the CUDA backend only.

Fixes intel/llvm#7543.
Fixes intel/llvm#6749.

---------

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
Co-authored-by: Nicolas Miller <nicolas.miller@codeplay.com>
Co-authored-by: JackAKirk <chezjakirk@gmail.com>
Co-authored-by: Steffen Larsen <steffen.larsen@intel.com>
fabiomestre pushed a commit to oneapi-src/unified-runtime that referenced this pull request Sep 27, 2023
This implements the current extension doc from
intel/llvm#6104 in the CUDA backend only.

Fixes intel/llvm#7543.
Fixes intel/llvm#6749.

---------

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
Co-authored-by: Nicolas Miller <nicolas.miller@codeplay.com>
Co-authored-by: JackAKirk <chezjakirk@gmail.com>
Co-authored-by: Steffen Larsen <steffen.larsen@intel.com>
omarahmed1111 pushed a commit to omarahmed1111/unified-runtime that referenced this pull request Oct 23, 2023
This implements the current extension doc from
intel/llvm#6104 in the CUDA backend only.

Fixes intel/llvm#7543.
Fixes intel/llvm#6749.

---------

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
Co-authored-by: Nicolas Miller <nicolas.miller@codeplay.com>
Co-authored-by: JackAKirk <chezjakirk@gmail.com>
Co-authored-by: Steffen Larsen <steffen.larsen@intel.com>
omarahmed1111 pushed a commit to omarahmed1111/unified-runtime that referenced this pull request Oct 23, 2023
This implements the current extension doc from
intel/llvm#6104 in the CUDA backend only.

Fixes intel/llvm#7543.
Fixes intel/llvm#6749.

---------

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
Co-authored-by: Nicolas Miller <nicolas.miller@codeplay.com>
Co-authored-by: JackAKirk <chezjakirk@gmail.com>
Co-authored-by: Steffen Larsen <steffen.larsen@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants