Skip to content
This repository

Some information relating to NV50 CUDA and shader code, for lack of better page name.

See http://www.nvidia.com/object/cuda_develop.html for general information about CUDA.

NOTE: tesla == NV50TCL, NV84TCL, NVA0TCL, or NVA8TCL: the object used for 3d rendering. turing == NV50_COMPUTE, the object used for launching standalone computations. tesla can launch blocks of code as one of 3 types of shaders in 3d rendering pipeline, turing launches them as standalone programs.

There were four revisions of CUDA, all backwards-compatible:

Version Cards New ISA stuff New non-ISA stuff
1.0 NV50 Original revision
1.1 NV8x, NV9x Atomic 32-bit instructions on g[], breakpoints Debugging support
1.2 NVA5, NVA8, some other NVAx? Atomic 64-bit instructions on g[], atomic 32-bit instructions on s[], vote instructions 32 warps and 16384 registers per MP
1.3 NVA0, some other NVAx? Double-precision floating-point instructions

They’re also known as sm_10, sm_11, sm_12, sm_13. Note that 1.2 was actually after 1.3 and is just 1.3 with double-precision thrown out.

Types of programs you can run:

  • VP: Vertex Program. Has inputs in a[], outputs in $o. Launched as first programmable stage of tesla rendering pipeline.
  • GP: Geomatry Program. Has inputs in a[] and p[], outputs in $o. Optionally launched as second programmable stage of tesla rendering pipeline.
  • FP: Fragment Program. Has inputs in v[], outputs in $r. Launched as third programmable stage of tesla rendering pipeline, when rasterization is enabled.
  • CP: Compute Program. Has inputs in s[], and is able to access s[] and g[]. Launched as standalone programs by turing.

General setup

tesla turing name description
VP GP FP CP
0×198 0×1c0 DMA_CODE_CB DMA segment for CBs and program code
0xf7c 0xf70 0xfa4 0×210 [VGFC]P_ADDRESS_HIGH Base address of code segment
0xf80 0xf74 0xfa8 0×214 [VGFC]P_ADDRESS_LOW
0×140c 0×1410 0×1414 0×3b4 [VGFC]P_START_ID Entry point: initial value of PC
0×16b0 0×17a0 0×198c 0×2c0 [VGFC]P_REG_ALLOC_TEMP Number of allocated $r registers
0×16b8 0×17a8 - - [VG]P_REG_ALLOC_RESULT Number of allocated $o registers
0×1440 0×380 CODE_CB_FLUSH Writing 0 flushes on-GPU caches of code and CB data. Needed after changing code segment contents.
0×19a0 0×3b8 WARP_HALVES Selects maximum number of threads per warp. 1 means 16 threads/warp, 2 means 32 threads/warp.

All code addresses, including entry points, bra/call/joinat target fields, and the values pushed on stack, need to be aligned to
4-byte units and are relative to code segment base.

TODO: need to figure out if CB/code cache is per-MP or per-TP or what.

ISA

For now look at nv50dis sources for all your NV50 ISA needs, it’ll get proper documentation one day.

Thread hierarchy

What Contained in How many per parent Notes
TP Device 1-10. Read PMC reg 0×1540 & 0xffff and count bits. Tile/texture processor. A block containing several MPs, 8 texturing units, and some code/const cache. Name doesn’t make sense, but
MP TP 1-3. Read PMC reg 0×1540 & 0×0f000000 and count bits. Multiprocessor. Contains register and shared memory pools divided between warps/blocks. 8192 registers for <NVA0, 16384 for >=NVA0; 16kB of shared memory. Also known as SM [Scalable Multiprocessor].
Block MP Variable CP-only: A single block of warps that share s[] memory and barriers. They need to fit into a single MP [because of s[]], but can be spread out across several warps, and you can have several blocks on single MP at the same time if they all fit.
SP MP 8 Scalar Processor: A single execuction unit of MP. Takes an active warp each cycle and executes next insn from it. Mapping to handled warps is unknown. Actually, nvidia specs and CUDAs doc are the only sources that say they actually exist.
Warp MP 24 for <NVA0, 32 for >= NVA0 Warp: A single block of threads sharing program counter and execution units. If threads within a warp diverge, only one branch can remain active, the other lays dormant until the active branch exits or decides to rejoin. Also the level of granularity for the vote insn.
Quad Warp 8 A group of 4 threads. In FP, they render a 2×2 square. Texture instructions assume this geometry for computing implicit derivatives, so you have to hack a bit to use them in non-FP. Quads are treated specially by some instructions.
Lane Warp 32 A single thread.
Quad 4

The quad of TP id, MP id, warp id, lane id unambigously identifies a single thread context contained in a GPU. This quad is contained in a special read-only register called physid that contains the following:

  • physid&0×0000001f: lane id
  • physid&0×00001f00: warp id
  • physid&0×00030000: MP id
  • physid&0×00f00000: TP id

Environment

Available registers

Name Indices Size Type Available in Description
$r? $r0-$r 32-bit RW All General-purpose registers. You need to tell tesla/turing the needed number for your particular program. rnum can be up to 128.
$r-$r127 Out-of-bounds GPRs. They usually read as 0 and can be used for that, but do something weird when you try to store them into mem.
$r?l $r0l-$r63l 16-bit Low 16-bit half of given $r register, usable in 16-bit insns.
$r?h $r0h-$r63h High 16-bit half of given $r register, usable in 16-bit insns.
$r?d $r0d-$r126d, number divisible by 2 64-bit Pair consisting of $r<num+1> [high] and $r [low] used as a single 64-bit register. Usable in l[] and g[] load/stores and f64 insns.
$r?q $r0q-$r124q, number divisible by 4 128-bit Quad consisting of $r<num+3>, $r<num+2>, $r<num+1>, $r used as a single 128-bit register. Usable in l[] and g[] load/stores.
$o? $o0-$o126 32-bit WO VP,GP Output registers. They’re write-only. Like $r, you need to tell tesla how many you need. Can be up to 128, but you won’t be able to use that last one.
$o127 All Bit-bucket register. Writes here get ignored, even if you declared output 127 [yes, this makes output 127 useless].
$o?l $o0l-$o63l 16-bit VP,GP 16-bit halves of output registers. They don’t really work: writing to any half will duplicate the value into both halves of given output. Probably useless.
$o?h $o0h-$o62h
$o63h All Bit-bucket register. Writes here get ignored, even if you declared output 63 [so you can use it as bit-bucket safely without disturbing $o63].
$a? $a1-$a4 16-bit RW All Address registers: can be used for addressing all memory spaces except g[]. These 4 are used by ptxas.
$a0 Special address register hardcoded to 0. Absolute addressing in many modes is actually addressing with this reg.
$a5-$a6 These registers also seem to be hardcoded to 0, but for no good reason. Avoid.
$a7 This register, otoh, seems to work, but isn’t used by ptxas for some reason.
$c? $c0-$c3 4-bit Conditional registers. A lot of instructions can be told to set one of them according to insn result. They contain 4 different 1-bit flags. These flags and their various combinations can be used for conditional execution.
(Special registers) 0:physid 32-bit RO Identifies the physical place of the thread on the GPU, see thread hierarchy info above.
1:clock Counts clock ticks. Probably.
2:??? Seems to always be 0.
3:??? Seems to always be 0×20.
4-7:pm0-pm3 Performance monitoring registers. Value can be changed directly by SET_PM[] methods, and you can set them to count some stuff by PM_MODE methods. Mostly unknown.

Bits in condition registers

Bit Name Description
0 Zero Set if result is 0 or NaN.
1 Sign Set if result has highest bit set [integer] or is negative or NaN [float]
2 Carry Set for integer addition if carry out of highest bit happened
3 Overflow Set for integer addition if high bit of destination doesn’t match the “true” sign of result. Calculated before saturation if insn does that, so you can check if saturation happened.

Memory spaces

Name Size Type Available in Description
c0-c15 Up to 64kiB each RO cached VRAM All Constant spaces: a locally-cached chunk of VRAM assumed to be constant by the card. First access can be slow, subsequent accesses fast.
l Up to 64kiB per thread RW VRAM All Local space: a per-thread chunk of VRAM. Has quite funny address translation applied to get real address. Just like g[], but with funky address mangling. And just as slow.
g0-g15 Up to 4GB each RW VRAM CP Global spaces: a directly-maped writable and readable chunk of VRAM. Supports some atomic ops on >=sm_11. Slow compared to others.
s Up to 16kB per block RW on-MP CP Shared space: a block of fast memory on the SM itself, shared between threads in a single block. Since sm_12 supports some atomic ops. First 0×10 bytes contain grid layout info, subsequent space contains parameters passed from host.
a ??? RO on-MP??? VP,GP Attribute space. Contains attributes/inputs to VP, pointers to primitives in p[] for GP. Probably like s[]. Not much is known.
p ??? GP Primitive space. Contains attributes/inputs to GP. Probably like s[]. Not much is known.
v ??? FP Varying space. Contains interpolated inputs to FP. Looks like there are at least 3 different ones for flat/normal/centroid varyings. Not much is known.

Addressing modes: for everything but g[], you use [$a+offset] and addresses are 16-bit. For g[] you use g[$r] and addresses are 32-bit.

Stack and local memory

Return addresses for call insn and rejoin points are stored on a stack. The stack is per-warp. A single stack entry is two 32-bit words. Stack entries are grouped into blocks of 4 entries. The MP can hold up to 3 blocks per warp [tested on NV86] inside itself, then it starts spilling the blocks to memory, one block at a time.

Local memory is simply an area of VRAM with some space carved out for each physid.

Addresses in both local and stack memory are both mangled together from physid and actual position. The formulas are non-trivial and are best described by the following C functions:


int compressed_tp_id(int tpid) {
  int res = 0;
  for (int i = 0; i < tpid; i++)
    if (TP i is active)
      res++;
  return res;
}

// compressed_mp_id defined in the same way

uint32_t stack_address (int entry_number, int warpid, int mpid, int tpid) {
  mpssize = mp_count * 0x20; // 4 entries, 8 bytes each
  warp_bits = STACK_WARPS_LOG_ALLOC;
  if (warp_bits == 7)
    warp_bits = 5;
  warpssize = mpssize << warp_bits;
  tp_count = number of active TPs;
  if (chipset < NVA0)
    align tp_count up to nearest power of two
  tpssize = warpssize * tp_count;
  return (entry_number>>2) * tpssize + compressed_tp_id(tpid) * warpssize + (warpid & ((1<<warp_bits)-1)) * mpssize + compressed_mp_id(mpid) * 0x20 + (entry_number & 3) * 8;
}

uint32_t local_address (uint32_t addr, int laneid, int warpid, int mpid, int tpid) {
  halfwarps = WARP_HALVES;
  if (this is CP and method 0x3b8 was set to 0)
    halfwarps = 1; // TODO: anything like this for tesla?
  blocksize = 0x100 * halfwarps;
  mpssize = mp_count * blocksize;
  warp_bits = LOCAL_WARPS_LOG_ALLOC;
  if (warp_bits == 7)
    warp_bits = 5;
  warpssize = mpssize << warp_bits;
  tp_count = number of active TPs;
  if (chipset < NVA0)
    align tp_count up to nearest power of two
  tpssize = warpssize * tp_count;
  return (addr>>4) * tpssize + compressed_tp_id(tpid) * warpssize + (warpid & ((1<<warp_bits)-1)) * mpssize + compressed_mp_id(mpid) * blocksize + (laneid >> 4) * 0x100 + (addr & 0xc) << 4 | (laneid & 0xf) << 2 | (addr & 3);
}

Or put another way:

Block is, for stack, 4 consecutive entries of single warp. For local, it is 0×100-byte or 0×200-byte block with the following address format:

- bits 0-1: bits 0-1 of l[] address - bits 2-5: bits 0-3 of laneid - bits 6-7: bits 2-3 of l[] address - bit 8, only if 32 lanes/warp enabled: bit 4 of laneid.

Blocks are further grouped into larger blocks by, in order: MP id, warp id, TP id, and remaining bits of address / entry number. Pre-NVA0 cards align each such grouping to POT size, NVA0+ don’t.

Stack & local segments are specified in the following methods:

tesla turing name what
0×194 0×1bc DMA_STACK stack DMA segment
0×190 0×1b8 DMA_LOCAL local DMA segment
0xd94 0×218 STACK_ADDRESS_HIGH stack address, high
0xd98 0×21c STACK_ADDRESS_LOW stack address, low
0xd9c 0×220 STACK_SIZE_LOGlog2(stack_size_in_entries) + 1, or 0 to disable stack
0×12d8 0×294 LOCAL_ADDRESS_HIGH local address, high
0×12dc 0×298 LOCAL_ADDRESS_LOW local address, low
0×12e0 0×29c LOCAL_SIZE_LOG log2(local_size_in_bytes/16) + 1, or 0 to disable local
0xf44 0×2fc LOCAL_WARPS_LOG_ALLOC Number of bits allocated for warp id in l[] addressing. As a special case, 5 is invalid value here, use 7 if you mean 5. Valid values 0-4 and 7.
0xf48 0×300 LOCAL_WARPS_NO_CLAMP If 0, don’t execute warps with ids that wouldn’t fit in l[] addressing. If 1, execute them anyway, masking off the extra bits, and aliasing earlier warps.
0xf4c 0×304 STACK_WARPS_LOG_ALLOC Number of bits allocated for warp id in stack addressing. As a special case, 5 is invalid value here, use 7 if you mean 5. Valid values 0-4 and 7.
0xf50 0×308 STACK_WARPS_NO_CLAMP If 0, don’t execute warps with ids that wouldn’t fit in stack addressing. If 1, execute them anyway, masking off the extra bits, and aliasing earlier warps.

The stack grows up from position 0. It is empty when execution starts.

Format for a single stack entry seems to be:

  • word0 & 0×003fffff: return or rejoin address, shifted right by 2.
  • word0 & 0×00c00000: a copy of bits 2 and 1 of this entry’s number. [why?]
  • word0 & 0×1f000000: a copy of warp id for some reason.
  • word0 & 0xe0000000: type of entry. known types: 010 == call without 0×40 in second word, 011 == call with 0×40 in second word, 110 == joinat.
  • word1: bitmask of active warps.

All joinat does is pushing an entry onto the stack [doesn’t validate address nor anything].

To check: exact behavior of join, call, and ret. Will ret/join complain about mismatched types? Can I manipulate the stack by mapping the same area with g[] and forcing spilling? That would enable very hackish indirect jumping in CPs…

Const spaces

There are 16 const spaces. Each of them can be independently bound to one of 128 CBs [const buffers]. To set up a const buffer, write its address to CB_DEF_ADDRESS_*, then write its number and size to CB_SET_DEF. CBs share their DMA object with program code. To bind a buffer to a c[] space in a particular type of program, write to SET_PROGRAM_CB. The data in CBs is cached and needs to be explicitly flushed by poking 0 to CB_FLUSH when you change it externally.

There’s also an upload function, which lets you upload data directly to a CB buffer using tesla/turing, and automatically updates cache [you don’t need to CB_FLUSH]. To use it, just write offset and buffer id to CB_ADDR, then throw data at CB_DATA. It doesn’t matter which CB_DATA you use, they’re all aliases [probably made so you can upload <=64 bytes with a single packet…]. The address behind CB_ADDR is autoincremented for each CB_DATA access.

The methods:

tesla turing name description
0×198 0×1c0 DMA_CODE_CB DMA segment for CBs and program code
0xf00 0×238 CB_ADDR Sets the upload address and buffer for subsequent CB_DATA upload. &0×7f: buffer id, &0×003fff00: target address, shifted right by 2 [or, in units of 32-bit words].
0xf04-0xf40 0×23c-0×278 CB_DATA[0-15] Upload method: anything written to any of these methods is stored at current upload address, then upload address is autoincremented by 4. It’s an error to upload after the address overflows.
0×1280 0×2a4 CB_DEF_ADDRESS_HIGH Address of CB, to be used by next CB_DEF_SET.
0×1284 0×2a8 CB_DEF_ADDRESS_LOW
0×1288 0×2ac CB_DEF_SET Sets address and size for a given CB id. Address taken from CB_DEF_ADDRESS_*. &0×7f0000: buffer id, &0xffff: size. Size needs to be multiple of 0×100 bytes. Setting size field to 0 is special and really means size 65536.
0×1440 0×380 CODE_CB_FLUSH Writing 0 flushes on-SM caches of code and CB data. Needed after changing CB contents with anything other than CB_ADDR/CB_DATA.
0×1694 0×3c8 SET_PROGRAM_CB Binds CB to a c[] space in a program. &0×7f000: CB id, &0xf00: c[] space number, &0xf0: program type for tesla [0 – VP, 2 – GP, 3 – FP], doesn’t apply for turing, &0×1: unknown flag, must be set to 1.

To check: flushes and that unknown flag.

CP

CPs are launched using turing. They’re launched in grids consisting of blocks consisting of threads. Blocks are independent units of computation and new ones start execution as soon as there’s enough free warps to hold them, giving partly-parallel partly-serial execution. Each block in a grid is identified by its x, y coordinates, and they appear to start execution in (y,x) lexicographical order.

A block, in turn, contains threads identified by their x, y, z coordinates. All threads in a block are guaranteed to be executed in parallel and have access to a barrier instruction that stops thread’s execution until all threads in the block have reached it. Each block is also assigned its own s[] memory space that can be accessed by all its threads.

The CP-specific methods are:

0×0388 GRIDID Grid ID: freeform 16-bit value
0×03a4 GRIDDIM Grid dimensions: y<<16 x, both x and y in 0-65535 range.
0×03a8 SHARED_SIZE Size of shared memory per block. Needs to be in units of 0×40 bytes.
0×03ac BLOCKDIM_XY Block dimensions: y<<16 x
0×03b0 BLOCKDIM_Z Block dimensions: z
0×02b4 THREADS_PER_BLOCK Threads per block
0×02b8 - Lane enable: accepts 0 and 1. 1 is needed for 32 lanes. No idea why have this in addition to WARP_HALVES.
0×0374 USER_PARAM_COUNT Parameter count, shifted left by 8 bits. Max 64.
0×0600-0×06fc USER_PARAM[0-63] Parameters.
0×02f8 - Unknown purpose, but you need to put 1 here after setting up all of the above, otherwise 0×368 causes DATA_ERROR.
0×0368 LAUNCH Write 0 here to actually launch the grid.

CPs use 32 lanes if 1 is written to 0×02b8 and 2 to 0×03b8, 16 lanes otherwise.

For each launched block, all (z,y,x) tuples in range (0,0,0) through (blockdim.z, blockdim.y, blockdim.x) are created, sorted lexicographically. Then first THREADS_PER_BLOCK of them are taken and assigned to sequential lanes, spanning multiple warps if needed. It is a DATA_ERROR if you LAUNCH with THREADS_PER_BLOCK > blockdim.x*blockdim.y*blockdim.z. So for blockdim.x == 2, blockdim.y == 3, blockdim.z == 4, THREADS_PER_BLOCK == 21, and 16 enabled lanes, threads in a block are assigned like this:

tid.x tid.y tid.z warp id lane id
0 0 0 a 0
1 0 0 a 1
0 1 0 a 2
1 1 0 a 3
0 2 0 a 4
1 2 0 a 5
0 0 1 a 6
1 0 1 a 7
0 1 1 a 8
1 1 1 a 9
0 2 1 a 0xa
1 2 1 a 0xb
0 0 2 a 0xc
1 0 2 a 0xd
0 1 2 a 0xe
1 1 2 a 0xf
0 2 2 b 0
1 2 2 b 1
0 0 3 b 2
1 0 3 b 3
0 1 3 b 4
1 1 3 cut off by THREADS_PER_BLOCK
0 2 3
1 2 3

When your block starts, shared memory for the block contains the following:
|Address|Size|PTX name|Description|
|0×0|/8.16-bit|gridid|Taken straight from GRIDID method.|
|0×2|ntid.x|/3.Block dimensions, taken from BLOCKDIM*|
|0×4|ntid.y|
|0×6|ntid.z|
|0×8|nctaid.x|/2.Grid dimensions, taken from GRIDDIM|
|0xa|nctaid.y|
|0xc|ctaid.x|/2.Coordinates of this block inside the grid, 0 through GRIDDIM.[XY]-1|
|0xe|ctaid.y|
|0×10|USER_PARAM_COUNT*4|-|Parameters as specified in USER_PARAM|

Also, at CP start, $r0 contains coordinates of the current thread inside its block:

  • & 0×0000ffff: tid.x
  • & 0×03ff0000: tid.y
  • & 0xfc000000: tid.z

Global memory

In CPs, there are 16 segments of global memory available. Each of them is independent, and can be either a linear area of VRAM, or a normal 2D surface. In the linear case, address used in g[] reference is simply used as an offset wrt GLOBAL_BASE_*.
If you want to use kernels compiled with nvcc, memory access is via segment g14.

However, when bound to a 2D surface, addressing is funnier. The high 16 bits of address passed to g[] are taken as y coordinate, low 16 as byte offset inside a line. Tiling is applied onto that according to specified tiling mode, with GLOBAL_PITCH used as tile pitch [distance between start of consecutive rows of tiles]. The resulting offset is then added to GLOBAL_BASE_* and bastardised with tile flags 0×7000.

Or, if you prefer a formula, for g[REG]:

  • X = REG&0xffff
  • Y = REG>>16
  • TILE_SHIFT = TILE_MODE+2
  • TILE_MASK = (1<<TILE_SHIFT) – 1
  • OFFS = (X&0×3f) + ((Y&TILE_MASK)<<6) + ((X>>6)<<(6+TILE_SHIFT))
  • ADDR = fuckmeharder_0x7000( (GLOBAL_BASE_HIGH[i]<<32) + GLOBAL_BASE_LOW[i] + OFFS + (Y>>TILE_SHIFT)*GLOBAL_PITCH[i] )

Accessible area is limited by the GLOBAL_LIMIT method. For linear memory, it’s set to size-1, where size needs to be a multiple of 0×100 bytes. A write is out of bound if REG > GLOBAL_LIMIT. For 2D surfaces, GLOBAL_LIMIT is a bitfield with limits for x and y parts. Access is considered out of bounds if (REG>>16) > (GLOBAL_LIMIT>>16) | (REG&0xffff) > (GLOBAL_LIMIT&0xffff).

So, methods [i is index of gi[] space you’re setting up]:

0×1a0 DMA_GLOBAL DMA segment used for all g[] spaces
0×400+(i<<5) GLOBAL_BASE_HIGH The base address of gi[] segment
0×404+(i<<5) GLOBAL_BASE_LOW
0×408+(i<<5) GLOBAL_PITCH Surface pitch for 2D surface, ignored for linear. Must be multiple of 0×100, max 0×800000.
0×40c+(i<<5) GLOBAL_LIMIT Highest allowed address. Needs to be (multiple of 0×100)-1. Has separate x and y parts for 2D surfaces.
0×410+(i<<5) GLOBAL_MODE Bit 0: 0 == 2D surface, 1 == linear buffer. Bits 8-10: tile mode if 2D surface.
Something went wrong with that request. Please try again.