diff --git a/mlir/docs/Dialects/NVVMDialect.md b/mlir/docs/Dialects/NVVMDialect.md index 12ec2b3fd989e..b2f5e888b9772 100644 --- a/mlir/docs/Dialects/NVVMDialect.md +++ b/mlir/docs/Dialects/NVVMDialect.md @@ -58,23 +58,45 @@ scopes and lifetimes: - **generic**: Can point to any memory space; requires runtime resolution of actual address space. Use when pointer origin is unknown at compile time. - Performance varies based on the underlying memory space. + Performance varies based on the underlying memory space. A pointer to this + memory space is represented by `LLVM_PointerGeneric` in the NVVM Ops. - **global**: Accessible by all threads across all blocks; persists across kernel launches. Highest latency but largest capacity (device memory). Best - for large data and inter-kernel communication. + for large data and inter-kernel communication. A pointer to this memory space + is represented by `LLVM_PointerGlobal` in the NVVM Ops. - **shared**: Shared within a thread block (CTA); very fast on-chip memory for cooperation between threads in the same block. Limited capacity. Ideal for block-level collaboration, caching, and reducing global memory traffic. + This memory is usually referred as `shared_cta` in the NVVMOps and as + `shared::cta` in the PTX ISA. A pointer to this memory space is represented + by the `LLVM_PointerShared` type in the NVVM Ops. - **constant**: Read-only memory cached per SM. Size typically limited to 64KB. - Best for read-only data and uniform values accessed by all threads. + Best for read-only data and uniform values accessed by all threads. A pointer + to this memory space is represented by `LLVM_PointerConst` type in NVVM Ops. - **local**: Private to each thread. Use for per-thread private data and - automatic variables that don't fit in registers. + automatic variables that don't fit in registers. A pointer to this memory is + represented by `LLVM_PointerLocal` type in NVVM Ops. - **tensor**: Special memory space for tensor core operations. Used by `tcgen05` instructions on SM 100+ for tensor input/output operations. + A pointer to this memory space is represented by the `LLVM_PointerTensor` + type in the NVVM Ops. - **shared_cluster**: Distributed shared memory across thread blocks within a cluster (SM 90+). Enables collaboration beyond single-block scope with fast - access across cluster threads. - + access across cluster threads. This memory is usually referred as + `shared_cluster` in the NVVMOps and as `shared::cluster` in the PTX ISA. + A pointer to this memory space is represented by the `LLVM_PointerSharedCluster` + type in the NVVM Ops. + +## MBarrier objects + +An ``mbarrier`` is a barrier created in shared memory that supports +synchronizing any subset of threads within a CTA. An *mbarrier object* +is an opaque object in shared memory with `.b64` type and an alignment of +8-bytes. Unlike ``nvvm.barrier`` Op which can access only a limited number +of barriers per CTA, the *mbarrier objects* are user-defined and are only +limited by the total shared memory size available. The list of operations +supported on an *mbarrier object* is exposed through the ``nvvm.mbarrier.*`` +family of NVVM Ops. ## Non-Goals