-
Notifications
You must be signed in to change notification settings - Fork 15.2k
[MLIR][NVVM][Docs] Explain memory spaces #168059
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@llvm/pr-subscribers-mlir @llvm/pr-subscribers-mlir-llvm Author: Guray Ozen (grypp) ChangesFull diff: https://github.com/llvm/llvm-project/pull/168059.diff 1 Files Affected:
diff --git a/mlir/include/mlir/Dialect/LLVMIR/NVVMOps.td b/mlir/include/mlir/Dialect/LLVMIR/NVVMOps.td
index 1cc5b74a3cb67..5992abc8efcfd 100644
--- a/mlir/include/mlir/Dialect/LLVMIR/NVVMOps.td
+++ b/mlir/include/mlir/Dialect/LLVMIR/NVVMOps.td
@@ -79,6 +79,45 @@ def NVVM_Dialect : Dialect {
sequence must be expressed directly, NVVM provides an `nvvm.inline_ptx` op to
embed PTX inline as a last-resort escape hatch, with explicit operands and
results.
+
+
+ **Memory Spaces:** The NVVM dialect introduces the following memory spaces,
+ each with distinct scopes and lifetimes:
+
+ | Memory Space | Scope | Lifetime |
+ |-------------------|----------------------|-------------------|
+ | `generic` | All threads | Context-dependent |
+ | `global` | All threads (device) | Application |
+ | `shared` | Thread block (CTA) | Kernel execution |
+ | `constant` | All threads (RO) | Application |
+ | `local` | Single thread | Kernel execution |
+ | `tensor` | Thread block (CTA) | Kernel execution |
+ | `shared_cluster` | Thread block cluster | Kernel execution |
+
+ **Memory Space Details:**
+ - **generic**: Can point to any memory space; requires runtime resolution of
+ actual address space. Use when pointer origin is unknown at compile time.
+ Performance varies based on the underlying memory space.
+ - **global**: Accessible by all threads across all blocks; persists across
+ kernel launches. Highest latency (~400-800 cycles) but largest capacity
+ (device memory). Best for large data and inter-kernel communication.
+ - **shared**: Shared within a thread block (CTA); very fast on-chip memory
+ (~20-40 cycles) for cooperation between threads in the same block. Limited
+ capacity (48-164KB depending on architecture). Ideal for block-level
+ collaboration, caching, and reducing global memory traffic.
+ - **constant**: Read-only memory cached per SM; optimized for broadcast
+ patterns where all threads access the same location. Fast access when cached
+ (~20 cycles). Size typically limited to 64KB. Best for read-only data and
+ uniform values accessed by all threads.
+ - **local**: Private to each thread; used for stack frames and register spills.
+ Actually resides in global memory but cached in L1 (~100-200 cycles). Use for
+ per-thread private data and automatic variables that don't fit in registers.
+ - **tensor**: Special memory space for Tensor Memory Accelerator (TMA)
+ operations on SM 80+ architectures; used with async tensor operations and
+ wgmma instructions. Provides very fast access for matrix operations.
+ - **shared_cluster**: Shared across thread blocks within a cluster (SM 90+);
+ enables collaboration beyond single-block scope with distributed shared
+ memory. Fast access (~40-80 cycles) across cluster threads.
}];
let name = "nvvm";
|
durga4github
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The latest revision LGTM
schwarzschild-radius
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, Thanks!
No description provided.