Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[mlir][GPU] Add RecursiveMemoryEffects to gpu.launch #75315

Merged

Conversation

matthias-springer
Copy link
Member

Infer the side effects of gpu.launch from its body.

@llvmbot
Copy link
Collaborator

llvmbot commented Dec 13, 2023

@llvm/pr-subscribers-mlir

@llvm/pr-subscribers-mlir-gpu

Author: Matthias Springer (matthias-springer)

Changes

Infer the side effects of gpu.launch from its body.


Full diff: https://github.com/llvm/llvm-project/pull/75315.diff

1 Files Affected:

  • (modified) mlir/include/mlir/Dialect/GPU/IR/GPUOps.td (+2-1)
diff --git a/mlir/include/mlir/Dialect/GPU/IR/GPUOps.td b/mlir/include/mlir/Dialect/GPU/IR/GPUOps.td
index 7cad1cd89fd63..d7ab397fb771c 100644
--- a/mlir/include/mlir/Dialect/GPU/IR/GPUOps.td
+++ b/mlir/include/mlir/Dialect/GPU/IR/GPUOps.td
@@ -672,7 +672,8 @@ def GPU_LaunchFuncOp :GPU_Op<"launch_func", [
 
 def GPU_LaunchOp : GPU_Op<"launch", [
       AutomaticAllocationScope, AttrSizedOperandSegments, GPU_AsyncOpInterface,
-      DeclareOpInterfaceMethods<InferIntRangeInterface>]>,
+      DeclareOpInterfaceMethods<InferIntRangeInterface>,
+      RecursiveMemoryEffects]>,
     Arguments<(ins Variadic<GPU_AsyncToken>:$asyncDependencies,
                Index:$gridSizeX, Index:$gridSizeY, Index:$gridSizeZ,
                Index:$blockSizeX, Index:$blockSizeY, Index:$blockSizeZ,

@grypp
Copy link
Member

grypp commented Dec 13, 2023

Shall we have also a test to show the motivation of this PR?

Infer the side effects of `gpu.launch` from its body.
@matthias-springer
Copy link
Member Author

Added a test case.


// -----

// The GPU kernel does not have any side effecting ops, so the entire
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not very familiar with RecursiveMemoryEffects, what exactly it does?

Is it safe with managed memory or unified memory? In these cases, a pointer address between gpu and cpu remains the same. Once can imagine there is a llvm.ptr and it can used to build memref descriptor. It can be used inside gpu.launch body, or something else, is it safe in these cases?

Copy link
Member Author

@matthias-springer matthias-springer Dec 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RecursiveMemoryEffects indicates that the side effects of this op are defined by the side effects of nested ops.

In essence: gpu.launch by itself does not have any side effects. E.g., it does not read/write to memory etc. But the GPU kernel may have ops that read/write to memory. When we use the C++ API to query the side effects of a gpu.launch, it will automatically check the side effects of the ops inside the kernel and return those.

If the ops in the kernel have no side effects, then the gpu.launch has no side effects. That's why the gpu.launch in the newly added test case folds away. It does not read/write to memory etc. It can just be removed.

Once can imagine there is a llvm.ptr and it can used to build memref descriptor. It can be used inside gpu.launch body, or something else, is it safe in these cases?

When you use that memref value, e.g., with a memref.load, you have an op with a side effect and the entire kernel will be considered to have that side effect.

Copy link
Member

@grypp grypp Dec 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your patience and explanation.

gpu.launch doesn't write to memory directly, it performs the following actions. It'll still need to read kernel configurations even wtih empty body. However, I think it's safe to fold.

  1. Reads host memory for kernel configuration and parameters.
  2. Copies kernel parameters to the device memory.
  3. Waits for stream(s).

Thinking about the 3rd option with IR below. If taskC (that has empty body) is folded , could taskD run before taskB due to not waiting for stream2 anymore? (Currently, MLIR utilizes single stream, but one can imagine implementing support for multiple streams .)

%taskA = gpu.launch async [%stream1] ...
%taskB = gpu.launch async [%stream2] ...
%taskC = gpu.launch async [%stream1, %stream2] ...
%taskD = gpu.launch async [%stream1] ...

task execution graph:

a   b
 \ /
  c
  |
  d

Is it overkill to think about this case?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The kernel configuration and parameters (e.g., number of blocks, etc.) are passed to gpu.launch as integer operands, so the op does not read anything from memory (memref) at this point, right? (The op may be lowered to something that reads from memory; that op would then have a memory side effect.)

gpu.launch takes async token arguments and produces an async token result. If there is a particular order in which kernels must be launched (and waited for), I would expect this to be modeled with async tokens. IMO, execution order due to "stream is available or not" does not have to be preserved when folding gpu.launch ops.

%taskA = gpu.launch async [] ...
%taskB = gpu.launch async [] ...
%taskC = gpu.launch async [%taskA, %taskB] ...
%taskD = gpu.launch async [%taskC] ...

In the above example, %taskC can canonicalized away if it has no side effects, but then %taskD will depend on %taskA and %taskB.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed offline. The way CUDA streams are expressed in the GPU dialect is via async tokens. It would not make sense for gpu.launch and other ops to take an additional %stream operand in the future. We use async tokens (%task) for that.

Coming back to RecursiveMemoryEffects, this trait should be safe to attach. There are no side effects wrt. streams, because streams are encoded with async tokens.

Empty GPU kernels like the one in the test case will not simply DCE away if they impose a dependency via async tokens. (A folder or canonicalization pattern may find a way to remove empty kernels in a way that preserves the dependencies expressed by the async tokens.)

@matthias-springer matthias-springer merged commit f709642 into llvm:main Dec 20, 2023
3 of 4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants