Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Epic] CUDA HAL driver rewrite for production #13245

Closed
allieculp opened this issue Apr 21, 2023 · 8 comments · Fixed by #16107
Closed

[Epic] CUDA HAL driver rewrite for production #13245

allieculp opened this issue Apr 21, 2023 · 8 comments · Fixed by #16107
Assignees
Labels
epic Tracking issues for related deliverables (as in Agile dev) p1

Comments

@allieculp
Copy link

allieculp commented Apr 21, 2023

Background

The current CUDA HAL driver was prototyped in the early days and have implementation details rendering production usage hard. To name a few big ones: full stream synchronization after each queue execute, no HAL semaphore support, no stream-ordered allocation support, linear nodes in CUDA graph, tightly integrated context wrapper making it hard to support multi queue/device. There are also various other smaller issues, like tightly coupled CUDA and NCCL symbols hindering optionality, etc.

Goals and Plans

We would need to rewrite portions of the CUDA HAL driver and modernize others accordingly to better implement IREE HAL abstractions and support future production use cases. Concretely, we would like the CUDA HAL driver to:

  • Support async workload submission for better GPU utilization
  • Implement proper HAL semaphore for host/device <-> host/device synchronization
  • Stream-ordered allocation to sequence allocations and compute
  • Implement HAL command buffers with CUDA graphs for low launch overhead
  • Factor in multi queue/device in various APIs to not limit the support

The above, if done properly, should make supporting multi queue/device training/inference and interleaving different CUDA library/CodeGen solutions and even external systems approaches easier.

Note that this doesn't concern the kernel and executable side much. There are places we can improve there too; but those are largely detachable and can happen separately.

The overall plan is to start a new experimental HAL driver in experiemental/, cuda2, and pull in pieces from existing CUDA HAL driver when suitable, and improve and land incrementally.

Concrete Tasks

These are major abstract tasks now; as we go we'll break them down into smaller ones and create seprate task list to track.

### Basics
- [ ] https://github.com/openxla/iree/pull/13942
- [ ] https://github.com/openxla/iree/pull/13955
- [ ] https://github.com/openxla/iree/pull/14011
- [ ] https://github.com/openxla/iree/pull/14063
- [ ] https://github.com/openxla/iree/issues/13250 
- [ ] Document this CUDA HAL driver and its relationship with other ecosystem components
### Memory Management
- [ ] https://github.com/openxla/iree/pull/13985
- [ ] https://github.com/openxla/iree/pull/13440
### Compute Pipeline
- [ ] https://github.com/openxla/iree/pull/14038
- [ ] https://github.com/openxla/iree/pull/14062
- [ ] https://github.com/openxla/iree/pull/14326
- [ ] https://github.com/openxla/iree/pull/14524
- [ ] https://github.com/openxla/iree/pull/14525
- [ ] https://github.com/openxla/iree/pull/14526
- [ ] https://github.com/openxla/iree/pull/15325
- [ ] https://github.com/openxla/iree/pull/14857
- [ ] https://github.com/openxla/iree/pull/15423
### Synchronization
- [ ] https://github.com/openxla/iree/pull/14325
- [ ] https://github.com/openxla/iree/issues/4727
### Collectives
- [ ] https://github.com/openxla/iree/pull/14059
@aaron-schneider aaron-schneider added the epic Tracking issues for related deliverables (as in Agile dev) label Apr 26, 2023
@allieculp allieculp changed the title [Epic] CUDA HAL rewrite (asynchronous w/ support for CUDA graphs) [Epic] CUDA HAL Async Implementation (asynchronous w/ support for CUDA graphs) May 2, 2023
@allieculp
Copy link
Author

@julianwa @antiagainst @mattwalsh @benvanik I believe this is still in Lei's queue, please update if any progress has been made.

@allieculp allieculp added the p1 label May 4, 2023
@allieculp
Copy link
Author

@julianwa @mattwalsh @antiagainst Noting here, this may need to be bumped up in priority before end of Q2.

@julianwa
Copy link
Member

julianwa commented May 8, 2023

@antiagainst when do you expect the metal back-end to land? I assume that it would be poor choice to start this prior to that. I'm just seeing what the earliest timeframe to start this is -- without considering the other work you have planned for Q2 for the moment.

@antiagainst
Copy link
Contributor

I'm under the impression that the CUDA HAL rewrite is scheduled to start at Q3. I'm expecting to finish up and land the Metal HAL impl by end of this month. So earliest we can start is June.

@antiagainst antiagainst changed the title [Epic] CUDA HAL Async Implementation (asynchronous w/ support for CUDA graphs) [Epic] CUDA HAL driver rewrite for production Jun 3, 2023
@antiagainst
Copy link
Contributor

I've updated the title and fleshed out some of the goals and plans in the issue, after discussing with @benvanik. This is not final and we'll continue refine the details as we go.

antiagainst added a commit that referenced this issue Jun 5, 2023
This commit starts a CUDA HAL driver rewrite under `experimental/`.
We create a new `cuda2/` directory to host the new code to avoid
interrupting the current CodeGen development.

This commit just brings in the basics for boot up a new HAL driver,
including dynamic symbols management, error status management, and
IREE HAL driver implementation. Most of the code is directly copied
from existing HAL driver, with noticeable changes:

* Split CUDA and NCCL dynamic symbols into separate structures for
  better organziation and allowing optionality.
* Fleshed out CUDA error to IREE status conversions.
* Better organized code blocks and improved error messages and
  various comments.

Building this commmit with `-DIREE_EXTERNAL_HAL_DRIVERS=cuda2`,
we can have `tools/iree-run-module --dump_devices` showing `cuda2`
devices, in parallel to the existing CUDA one.

Progress towards #13245
antiagainst added a commit that referenced this issue Jun 8, 2023
This commit ports over existing CUDA driver allocator and buffer
implementation. The main logic is kept as-is, with one noticeable
changes--context wrapper is dropped and fields in it are directly
put in various API calls. This is to make supporting multiple
device and stream easier later. Other changes are just polishing
on comments and errors.

Progress towards #13245
antiagainst added a commit that referenced this issue Jun 9, 2023
This commit ports over existing CUDA HAL driver code to enable
creating devices, allocators, and buffers. It's mostly NFC,
with just symbol renaming and turning various functionalities
into unimplemented.

With this commit, we are able to pick up CTS tests.

Progress towards #13245
antiagainst added a commit that referenced this issue Jun 12, 2023
This commit reuses the existing CUDA HAL driver's descriptor set
and pipeline layout implementation. Along the way, dropped the
context wrapper and adds more documentation.

Progress towards #13245
antiagainst added a commit that referenced this issue Jun 12, 2023
nithinsubbiah added a commit to nithinsubbiah/iree that referenced this issue Nov 9, 2023
This commit starts a HIP backend in HAL. Following the steps of CUDA rewrite (iree-org#13245) effort, this backend will provide improvements over the existing ROCm HAL backend. Building this commmit with -DIREE_EXTERNAL_HAL_DRIVERS=hip will initialize this driver and the driver information can be seen using `tools/iree-run-module --dump_devices`.
nithinsubbiah added a commit to nithinsubbiah/iree that referenced this issue Nov 13, 2023
This commit starts a HIP backend in HAL. Following the steps of CUDA rewrite (iree-org#13245) effort, this backend will provide improvements over the existing ROCm HAL backend. Building this commmit with -DIREE_EXTERNAL_HAL_DRIVERS=hip will initialize this driver and the driver information can be seen using `tools/iree-run-module --dump_devices`.
antiagainst added a commit that referenced this issue Nov 14, 2023
This commit changes the stream command buffer to keep track of a fixed
list of descriptor sets. This simplifies the logic of pushing
descriptors/constants; we only perform kernel parameter serialization at
the dispatch time. This means we don't pay the overhead if multiple push
descriptor/constant commands are issued before dispatching. Also we
don't need to sort descriptors anymore.

Progress towards #13245
antiagainst added a commit that referenced this issue Nov 14, 2023
Legacy sync mode forces waiting on semaphore immediately so
it effectively runs all async allocation/execution in sync mode.
Now we have proper semaphore emulation in cuda2, we don't
need this anymore. Disable it for end-to-end op tests for
experimental cuda2 driver.

Progress towards #13245
Groverkss pushed a commit that referenced this issue Nov 30, 2023
This commit starts a HIP backend in HAL. Following the steps of the new
CUDA backend (`cuda2`) (#13245)
effort, HIP will provide enhancements over existing ROCm HAL backend
which is based on the current CUDA backend. This commit can be built
with `-DIREE_EXTERNAL_HAL_DRIVERS=hip` which will initialize the HIP
driver. Device registration can be verified using `tools/iree-run-module
--dump_devices`.
ramiro050 pushed a commit to ramiro050/iree that referenced this issue Dec 19, 2023
This commit changes the graph command buffer to keep track of a fixed
list of descriptor sets. This simplifies the logic of pushing
descriptors/constants; we only perform kernel parameter serialization at
the dispatch time. This means we don't pay the overhead if multiple push
descriptor/constant commands are issued before dispatching. Also we
don't need to sort descriptors anymore.


Progress towards iree-org#13245
ramiro050 pushed a commit to ramiro050/iree that referenced this issue Dec 19, 2023
)

This commit improves the graph node dependency management by avoiding
serializing all nodes in the graph command buffer to a linear chain.
Command buffer barriers become empty CUDA graph nodes depending on
nodes created after a previous barrier.

Progress towards iree-org#13245

---------

Co-authored-by: Lei Zhang <antiagainst@gmail.com>
ramiro050 pushed a commit to ramiro050/iree that referenced this issue Dec 19, 2023
Events aren't used by the compiler right now.

Progress towards iree-org#13245
ramiro050 pushed a commit to ramiro050/iree that referenced this issue Dec 19, 2023
…5437)

This commit changes the stream command buffer to keep track of a fixed
list of descriptor sets. This simplifies the logic of pushing
descriptors/constants; we only perform kernel parameter serialization at
the dispatch time. This means we don't pay the overhead if multiple push
descriptor/constant commands are issued before dispatching. Also we
don't need to sort descriptors anymore.

Progress towards iree-org#13245
ramiro050 pushed a commit to ramiro050/iree that referenced this issue Dec 19, 2023
Legacy sync mode forces waiting on semaphore immediately so
it effectively runs all async allocation/execution in sync mode.
Now we have proper semaphore emulation in cuda2, we don't
need this anymore. Disable it for end-to-end op tests for
experimental cuda2 driver.

Progress towards iree-org#13245
ramiro050 pushed a commit to ramiro050/iree that referenced this issue Dec 19, 2023
This commit starts a HIP backend in HAL. Following the steps of the new
CUDA backend (`cuda2`) (iree-org#13245)
effort, HIP will provide enhancements over existing ROCm HAL backend
which is based on the current CUDA backend. This commit can be built
with `-DIREE_EXTERNAL_HAL_DRIVERS=hip` which will initialize the HIP
driver. Device registration can be verified using `tools/iree-run-module
--dump_devices`.
antiagainst added a commit that referenced this issue Jan 9, 2024
…15673)

When GPU signals the CUevent-implemented semaphore to a new value, we
need to perform a `cuLaunchHostFunc()` to invoke the CPU to signal to
that value. At the same time, we can try to advance the deferred queue
to release more workload to the GPU. Advancing the deferred queue cannot
happen as part of the `cuLaunchHostFunc()` call, given that is in a
driver thread. Per the documentation:

  The host function must not make any CUDA API calls. Attempting
  to use a CUDA API may result in CUDA_ERROR_NOT_PERMITTED, but
  this is not required. The host function must not perform any
  synchronization that may depend on outstanding CUDA work not
  mandated to run earlier. Host functions without a mandated order
  (such as in independent streams) execute in undefined order and
  may be serialized.

We were actually doing that and seeing deadlock. So this commit changes
the implemention to create a new thread dedicated to monitoring and
releasing workload to the GPU, to avoid calling CUDA APIs as part of the
`cuLaunchHostFunc()` call.

Progress towards #13245
antiagainst added a commit that referenced this issue Jan 9, 2024
The device already retains the devic event pool, so the device event
pool should not retain the device again. Otherwise, they both have at
least one retain and will never trigger proper free naturally.

Progress towards #13245
antiagainst added a commit that referenced this issue Jan 10, 2024
This commit moves the CUDA HAL driver rewrite to the
`hal/drivers` directory given it's functional and ready for
normal usage. By this we can start run tests with CI
to make sure it does not regress. Further improvements
can happen directly in this directory.

This provides an easy route for trying out the rewrite before
eventually replace the existing HAL driver. 

Along the way wired up BUILD configurations.

Progress towards #13245
@antiagainst
Copy link
Contributor

#14620 landed; now cuda2 is in hal/drivers. Next, we need:

  • Replace various tests/benchmarks in-tree to use cuda2 and address potential issues
  • [After bing stable of some time..]
  • Delete cuda and rename cuda2 as cuda

Also sent https://groups.google.com/g/iree-discuss/c/q2_HgHmGma0 to the group.

@antiagainst
Copy link
Contributor

#16107 switches the new cuda2 impl on by default. Following up work is to delete cuda1 code and rename cuda2 symbols to cuda, but not blocking, so this issue is marked as closed now.

antiagainst added a commit that referenced this issue Jan 27, 2024
This is the last step of refreshing the cuda HAL implementation.
Mostly done via the following command:

```sh
find runtime/ -type f -exec sed -i 's/CUDA2/CUDA/g' {} \;
find runtime/ -type f -exec sed -i 's/cuda2/cuda/g' {} \;
```

Related to #13245
Groverkss pushed a commit that referenced this issue Feb 2, 2024
This is the last step of refreshing the cuda HAL implementation.
Mostly done via the following command:

```sh
find runtime/ -type f -exec sed -i 's/CUDA2/CUDA/g' {} \;
find runtime/ -type f -exec sed -i 's/cuda2/cuda/g' {} \;
```

Related to #13245
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
epic Tracking issues for related deliverables (as in Agile dev) p1
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants