Skip to content

Commit bbf17fd

Browse files
committed
[NVVM] Make nanosleep op duration SSA value
1 parent 2681497 commit bbf17fd

File tree

1 file changed

+38
-0
lines changed

1 file changed

+38
-0
lines changed

mlir/include/mlir/Dialect/LLVMIR/NVVMOps.td

Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -79,6 +79,44 @@ def NVVM_Dialect : Dialect {
7979
sequence must be expressed directly, NVVM provides an `nvvm.inline_ptx` op to
8080
embed PTX inline as a last-resort escape hatch, with explicit operands and
8181
results.
82+
83+
84+
**Memory Spaces:** The NVVM dialect introduces the following memory spaces,
85+
each with distinct scopes and lifetimes:
86+
87+
| Memory Space | Scope | Lifetime |
88+
|-------------------|----------------------|-------------------|
89+
| `generic` | All threads | Context-dependent |
90+
| `global` | All threads (device) | Application |
91+
| `shared` | Thread block (CTA) | Kernel execution |
92+
| `constant` | All threads (RO) | Application |
93+
| `local` | Single thread | Kernel execution |
94+
| `tensor` | Thread block (CTA) | Kernel execution |
95+
| `shared_cluster` | Thread block cluster | Kernel execution |
96+
97+
**Memory Space Details:**
98+
- **generic**: Can point to any memory space; requires runtime resolution of
99+
actual address space. Use when pointer origin is unknown at compile time.
100+
Performance varies based on the underlying memory space.
101+
- **global**: Accessible by all threads across all blocks; persists across
102+
kernel launches. Highest latency but largest capacity (device memory). Best
103+
for large data and inter-kernel communication.
104+
- **shared**: Shared within a thread block (CTA); very fast on-chip memory for
105+
cooperation between threads in the same block. Limited capacity. Ideal for
106+
block-level collaboration, caching, and reducing global memory traffic.
107+
- **constant**: Read-only memory cached per SM; optimized for broadcast
108+
patterns where all threads access the same location. Fast access when cached.
109+
Size typically limited to 64KB. Best for read-only data and uniform values
110+
accessed by all threads.
111+
- **local**: Private to each thread; used for stack frames and register spills.
112+
Actually resides in global memory but cached in L1. Use for per-thread
113+
private data and automatic variables that don't fit in registers.
114+
- **tensor**: Special memory space for Tensor Memory Accelerator (TMA)
115+
operations on SM 80+ architectures; used with async tensor operations and
116+
wgmma instructions. Provides very fast access for matrix operations.
117+
- **shared_cluster**: Shared across thread blocks within a cluster (SM 90+);
118+
enables collaboration beyond single-block scope with distributed shared
119+
memory. Fast access across cluster threads.
82120
}];
83121

84122
let name = "nvvm";

0 commit comments

Comments
 (0)