Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
tree: aeeff659c1
Fetching contributors…

Cannot retrieve contributors at this time

721 lines (553 sloc) 20.024 kb
This work is licensed under the Creative Commons Attribution
3.0 Unported License. To view a copy of this license, visit
http://creativecommons.org/licenses/by/3.0/ or send a letter
to Creative Commons, 171 Second Street, Suite 300, San Francisco,
California, 94105, USA.
Instructions
0. Conventions
1. MOV
2. Special regs
2.1. MOV from $c
2.2. MOV to $c
2.3. MOV from $a
2.4. SHL to $a
2.5. ADD from $a to $a
2.6. MOV from sreg
3. Integer instructions
3.1. Integer ADD family
3.2. Integer short MUL
3.3. Integer 24-bit MUL
3.4. Integer MUL-ADD
3.5. Integer SAD
3.6. Integer MIN/MAX
3.7. Integer SET
4. Bit instructions
4.1. Bit operations
4.2. Bit shifts
5. TBD
0. Conventions
S(x): 31th bit of x for 32-bit x, 15th for 16-bit x.
SEX(x): sign-extension of x
ZEX(x): zero-extension of x
1. Normal MOV
[lanemask] mov b32/b16 DST SRC
lanemask assumed 0xf for short and immediate versions.
if (lanemask & 1 << (laneid & 3)) DST = SRC;
Short: 0x10000000 base opcode
0x00008000 0: b16, 1: b32
operands: S*DST, S*SRC1/S*SHARED
Imm: 0x10000000 base opcode
0x00008000 0: b16, 1: b32
operands: L*DST, IMM
Long: 0x10000000 0x00000000 base opcode
0x00000000 0x04000000 0: b16, 1: b32
0x00000000 0x0003c000 lanemask
operands: LL*DST, L*SRC1/L*SHARED
2.1. MOV from $c
mov DST COND
DST is 32-bit $r.
DST = COND;
Long: 0x00000000 0x20000000 base opcode
operands: LDST, COND
2.2. MOV to $c
mov CDST SRC
SRC is 32-bit $r. Yes, the 0x40 $c write enable flag in second word is
actually ignored.
CDST = SRC;
Long: 0x00000000 0xa0000000 base opcode
operands: CDST, LSRC1
2.3. MOV from $a
mov DST AREG
DST is 32-bit $r. Setting flag normally used for autoincrement mode doesn't
work, but still causes crash when using non-writable $a's.
DST = AREG;
Long: 0x00000000 0x40000000 base opcode
0x02000000 0x00000000 crashy flag
operands: LDST, AREG
2.4. SHL to $a
shl ADST SRC SHCNT
SRC is 32-bit $r.
ADST = SRC << SHCNT;
Long: 0x00000000 0xc0000000 base opcode
operands: ADST, LSRC1/LSHARED, HSHCNT
2.5. ADD from $a to $a
add ADST AREG OFFS
Like mov from $a, setting flag normally used for autoincrement mode doesn't
work, but still causes crash when using non-writable $a's.
ADST = AREG + OFFS;
Long: 0xd0000000 0x20000000 base opcode
0x02000000 0x00000000 crashy flag
operands: ADST, AREG, OFFS
2.6. MOV from sreg
mov DST physid S=0
mov DST clock S=1
mov DST sreg2 S=2
mov DST sreg3 S=3
mov DST pm0 S=4
mov DST pm1 S=5
mov DST pm2 S=6
mov DST pm3 S=7
DST is 32-bit $r.
DST = SREG;
Long: 0x00000000 0x60000000 base opcode
0x00000000 0x0001c000 S
operands: LDST
3.1. Integer ADD family
add [sat] b32/b16 [CDST] DST SRC1 SRC2 O2=0, O1=0
sub [sat] b32/b16 [CDST] DST SRC1 SRC2 O2=0, O1=1
subr [sat] b32/b16 [CDST] DST SRC1 SRC2 O2=1, O1=0
addc [sat] b32/b16 [CDST] DST SRC1 SRC2 COND O2=1, O1=1
All operands are 32-bit or 16-bit according to size specifier.
b16/b32 s1, s2;
bool c;
switch (OP) {
case add: s1 = SRC1, s2 = SRC2, c = 0; break;
case sub: s1 = SRC1, s2 = ~SRC2, c = 1; break;
case subr: s1 = ~SRC1, s2 = SRC2, c = 1; break;
case addc: s1 = SRC1, s2 = SRC2, c = COND.C; break;
}
res = s1+s2+c; // infinite precision
CDST.C = res >> (b32 ? 32 : 16);
res = res & (b32 ? 0xffffffff : 0xffff);
CDST.O = (S(s1) == S(s2)) && (S(s1) != S(res));
if (sat && CDST.O)
if (S(res)) res = (b32 ? 0x7fffffff : 0x7fff);
else res = (b32 ? 0x80000000 : 0x8000);
CDST.S = S(res);
CDST.Z = res == 0;
DST = res;
Short/imm: 0x20000000 base opcode
0x10000000 O2 bit
0x00400000 O1 bit
0x00008000 0: b16, 1: b32
0x00000100 sat flag
operands: S*DST, S*SRC1/S*SHARED, S*SRC2/S*CONST/IMM, $c0
Long: 0x20000000 0x00000000 base opcode
0x10000000 0x00000000 O2 bit
0x00400000 0x00000000 O1 bit
0x00000000 0x04000000 0: b16, 1: b32
0x00000000 0x08000000 sat flag
operands: MCDST, LL*DST, L*SRC1/L*SHARED, L*SRC3/L*CONST3, COND
3.2. Integer short MUL
mul [CDST] DST u16/s16 SRC1 u16/s16 SRC2
DST is 32-bit, SRC1 and SRC2 are 16-bit.
b32 s1, s2;
if (src1_signed)
s1 = SEX(SRC1);
else
s1 = ZEX(SRC1);
if (src2_signed)
s2 = SEX(SRC2);
else
s2 = ZEX(SRC2);
b32 res = s1*s2; // modulo 2^32
CDST.O = 0;
CDST.C = 0;
CDST.S = S(res);
CDST.Z = res == 0;
DST = res;
Short/imm: 0x40000000 base opcode
0x00008000 src1 is signed
0x00000100 src2 is signed
operands: SDST, SHSRC/SHSHARED, SHSRC2/SHCONST/IMM
Long: 0x40000000 0x00000000 base opcode
0x00000000 0x00008000 src1 is signed
0x00000000 0x00004000 src2 is signed
operands: MCDST, LLDST, LHSRC1/LHSHARED, LHSRC2/LHCONST2
3.3. Integer 24-bit MUL
mul [CDST] DST [high] u24/s24 SRC1 SRC2
All operands are 32-bit.
b48 s1, s2;
if (signed) {
s1 = SEX((b24)SRC1);
s2 = SEX((b24)SRC2);
} else {
s1 = ZEX((b24)SRC1);
s2 = ZEX((b24)SRC2);
}
b48 m = s1*s2; // modulo 2^48
b32 res = (high ? m >> 16 : m & 0xffffffff);
CDST.O = 0;
CDST.C = 0;
CDST.S = S(res);
CDST.Z = res == 0;
DST = res;
Short/imm: 0x40000000 base opcode
0x00008000 src are signed
0x00000100 high
operands: SDST, SSRC/SSHARED, SSRC2/SCONST/IMM
Long: 0x40000000 0x00000000 base opcode
0x00000000 0x00008000 src are signed
0x00000000 0x00004000 high
operands: MCDST, LLDST, LSRC1/LSHARED, LSRC2/LCONST2
3.4. Integer MUL-ADD
addop [CDST] DST mul u16 SRC1 SRC2 SRC3 O1=0 O2=000 S2=0 S1=0
addop [CDST] DST mul s16 SRC1 SRC2 SRC3 O1=0 O2=001 S2=0 S1=1
addop sat [CDST] DST mul s16 SRC1 SRC2 SRC3 O1=0 O2=010 S2=1 S1=0
addop [CDST] DST mul u24 SRC1 SRC2 SRC3 O1=0 O2=011 S2=1 S1=1
addop [CDST] DST mul s24 SRC1 SRC2 SRC3 O1=0 O2=100
addop sat [CDST] DST mul s24 SRC1 SRC2 SRC3 O1=0 O2=101
addop [CDST] DST mul high u24 SRC1 SRC2 SRC3 O1=0 O2=110
addop [CDST] DST mul high s24 SRC1 SRC2 SRC3 O1=0 O2=111
addop sat [CDST] DST mul high s24 SRC1 SRC2 SRC3 O1=1 O2=000
addop is one of:
add O3=00 S4=0 S3=0
sub O3=01 S4=0 S3=1
subr O3=10 S4=1 S3=0
addc O3=11 S4=1 S3=1
If addop is addc, insn also takes an additional COND parameter. DST and
SRC3 are always 32-bit, SRC1 and SRC2 are 16-bit for u16/s16 variants,
32-bit for u24/s24 variants. Only a few of the variants are encodable as
short/immediate, and they're restricted to DST=SRC3.
if (u24 || s24) {
b48 s1, s2;
if (s24) {
s1 = SEX((b24)SRC1);
s2 = SEX((b24)SRC2);
} else {
s1 = ZEX((b24)SRC1);
s2 = ZEX((b24)SRC2);
}
b48 m = s1*s2; // modulo 2^48
b32 mres = (high ? m >> 16 : m & 0xffffffff);
} else {
b32 s1, s2;
if (s16) {
s1 = SEX(SRC1);
s2 = SEX(SRC2);
} else {
s1 = ZEX(SRC1);
s2 = ZEX(SRC2);
}
b32 mres = s1*s2; // modulo 2^32
}
b32 s1, s2;
bool c;
switch (OP) {
case add: s1 = mres, s2 = SRC3, c = 0; break;
case sub: s1 = mres, s2 = ~SRC3, c = 1; break;
case subr: s1 = ~mres, s2 = SRC3, c = 1; break;
case addc: s1 = mres, s2 = SRC3, c = COND.C; break;
}
res = s1+s2+c; // infinite precision
CDST.C = res >> 32;
res = res & 0xffffffff;
CDST.O = (S(s1) == S(s2)) && (S(s1) != S(res));
if (sat && CDST.O)
if (S(res)) res = 0x7fffffff;
else res = 0x80000000;
CDST.S = S(res);
CDST.Z = res == 0;
DST = res;
Short/imm: 0x60000000 base opcode
0x00000100 S1
0x00008000 S2
0x00400000 S3
0x10000000 S4
operands: SDST, S*SRC/S*SHARED, S*SRC2/S*CONST/IMM, SDST, $c0
Long: 0x60000000 0x00000000 base opcode
0x10000000 0x00000000 O1
0x00000000 0xe0000000 O2
0x00000000 0x0c000000 O3
operands: MCDST, LLDST, L*SRC1/L*SHARED, L*SRC2/L*CONST2, L*SRC3/L*CONST3, COND
3.5. Integer SAD
sad [CDST] DST u16/s16/u32/s32 SRC1 SRC2 SRC3
Short variant is restricted to DST same as SRC3. All operands are 32-bit or
16-bit according to size specifier.
int s1, s2; // infinite precision
if (signed) {
s1 = SEX(SRC1);
s2 = SEX(SRC2);
} else {
s1 = ZEX(SRC1);
s2 = ZEX(SRC2);
}
b32 mres = abs(s1-s2); // modulo 2^32
res = mres+s3; // infinite precision
CDST.C = res >> (b32 ? 32 : 16);
res = res & (b32 ? 0xffffffff : 0xffff);
CDST.O = (S(mres) == S(s3)) && (S(mres) != S(res));
CDST.S = S(res);
CDST.Z = res == 0;
DST = res;
Short: 0x50000000 base opcode
0x00008000 0: b16 1: b32
0x00000100 src are signed
operands: DST, SDST, S*SRC/S*SHARED, S*SRC2/S*CONST, SDST
Long: 0x50000000 0x00000000 base opcode
0x00000000 0x04000000 0: b16, 1: b32
0x00000000 0x08000000 src sre signed
operands: MCDST, LLDST, L*SRC1/L*SHARED, L*SRC2/L*CONST2, L*SRC3/L*CONST3
3.6. Integer MIN/MAX
min u16/u32/s16/s32 [CDST] DST SRC1 SRC2
max u16/u32/s16/s32 [CDST] DST SRC1 SRC2
All operands are 32-bit or 16-bit according to size specifier.
if (SRC1 < SRC2) { // signed comparison for s16/s32, unsigned for u16/u32.
res = (min ? SRC1 : SRC2);
} else {
res = (min ? SRC2 : SRC1);
}
CDST.O = 0;
CDST.C = 0;
CDST.S = S(res);
CDST.Z = res == 0;
DST = res;
Long: 0x30000000 0x80000000 base opcode
0x00000000 0x20000000 0: max, 1: min
0x00000000 0x08000000 0: u16/u32, 1: s16/s32
0x00000000 0x04000000 0: b16, 1: b32
operands: MCDST, LL*DST, L*SRC1/L*SHARED, L*SRC2/L*CONST2
3.7 Integer SET
set [CDST] DST cond u16/s16/u32/s32 SRC1 SRC2
cond can be any subset of {l, g, e}.
All operands are 32-bit or 16-bit according to size specifier.
int s1, s2; // infinite precision
if (signed) {
s1 = SEX(SRC1);
s2 = SEX(SRC2);
} else {
s1 = ZEX(SRC1);
s2 = ZEX(SRC2);
}
bool c;
if (s1 < s2)
c = cond.l;
else if (s1 == s2)
c = cond.e;
else /* s1 > s2 */
c = cond.g;
if (c) {
res = (b32?0xffffffff:0xffff);
} else {
res = 0;
}
CDST.O = 0;
CDST.C = 0;
CDST.S = S(res);
CDST.Z = res == 0;
DST = res;
Long: 0x30000000 0x60000000 base opcode
0x00000000 0x08000000 0: u16/u32, 1: s16/s32
0x00000000 0x04000000 0: b16, 1: b32
0x00000000 0x00010000 cond.g
0x00000000 0x00008000 cond.e
0x00000000 0x00004000 cond.l
operands: MCDST, LL*DST, L*SRC1/L*SHARED, L*SRC2/L*CONST2
4.1. Bit operations
and b32/b16 [CDST] DST [not] SRC1 [not] SRC2 O2=0, O1=0
or b32/b16 [CDST] DST [not] SRC1 [not] SRC2 O2=0, O1=1
xor b32/b16 [CDST] DST [not] SRC1 [not] SRC2 O2=1, O1=0
mov2 b32/b16 [CDST] DST [not] SRC1 [not] SRC2 O2=1, O1=1
Immediate forms only allows 32-bit operands, and cannot negate second op.
s1 = (not1 ? ~SRC1 : SRC1);
s2 = (not2 ? ~SRC2 : SRC2);
switch (OP) {
case and: res = s1 & s2; break;
case or: res = s1 | s2; break;
case xor: res = s1 ^ s2; break;
case mov2: res = s2; break;
}
CDST.O = 0;
CDST.C = 0;
CDST.S = S(res);
CDST.Z = res == 0;
DST = res;
Imm: 0xd0000000 base opcode
0x00400000 not1
0x00008000 O2 bit
0x00000100 O1 bit
operands: SDST, SSRC/SSHARED, IMM
assumed: not2=0 and b32.
Long: 0xd0000000 0x00000000 base opcode
0x00000000 0x04000000 0: b16, 1: b32
0x00000000 0x00020000 not2
0x00000000 0x00010000 not1
0x00000000 0x00008000 O2 bit
0x00000000 0x00004000 O1 bit
operands: MCDST, LL*DST, L*SRC1/L*SHARED, L*SRC2/L*CONST2
4.2. Bit shifts
shl b16/b32 [CDST] DST SRC1 SRC2
shl b16/b32 [CDST] DST SRC1 SHCNT
shr u16/u32 [CDST] DST SRC1 SRC2
shr u16/u32 [CDST] DST SRC1 SHCNT
shr s16/s32 [CDST] DST SRC1 SRC2
shr s16/s32 [CDST] DST SRC1 SHCNT
All operands 16/32-bit according to size specifier, except SHCNT. Shift
counts are always treated as unsigned, passing negative value to shl
doesn't get you a shr.
int size = (b32 ? 32 : 16);
if (shl) {
res = SRC1 << SRC2; // infinite precision, shift count doesn't wrap.
if (SRC2 < size) { // yes, <. So if you shift 1 left by 32 bits, you DON'T get CDST.C set. but shift 2 left by 31 bits, and it gets set just fine.
CDST.C = (res >> size) & 1; // basically, the bit that got shifted out.
} else {
CDST.C = 0;
}
res = res & (b32 ? 0xffffffff : 0xffff);
} else {
res = SRC1 >> SRC2; // infinite precision, shift count doesn't wrap.
if (signed && S(SRC1)) {
if (SRC2 < size)
res |= (1<<size)-(1<<(size-SRC2)); // fill out the upper bits with 1's.
else
res |= (1<<size)-1;
}
if (SRC2 < size && SRC2 > 0) {
CDST.C = (SRC1 >> (SRC2-1)) & 1;
} else {
CDST.C = 0;
}
}
if (SRC2 == 1) {
CDST.O = (S(SRC1) != S(res));
} else {
CDST.O = 0;
}
CDST.S = S(res);
CDST.Z = res == 0;
DST = res;
Long: 0x30000000 0xc0000000 base opcode
0x00000000 0x20000000 0: shl, 1: shr
0x00000000 0x08000000 0: u16/u32, 1: s16/s32 [shr only]
0x00000000 0x04000000 0: b16, 1: b32
0x00000000 0x00010000 0: use SRC2, 1: use SHCNT
operands: MCDST, LL*DST, L*SRC1/L*SHARED, L*SRC2/L*CONST2/SHCNT
5. TBD
interp [cent] [flat] DST v[] [SRC]
Gets interpolated FP input, optionally multiplying by a given value
rcp f32 DST SRC
rsqrt f32 DST SRC
lg2 f32 DST SRC
sin f32 DST SRC
cos f32 DST SRC
ex2 f32 DST SRC
Computes a transcendential function of the argument. rcp is 1/x, rsqrt is
1/sqrt(x). sin, cos, ex2 need arguments preprocessed by appropriate pre
insn. rcp, rsqrt, lg2 take a float argument directly.
presin f32 DST SRC
preex2 f32 DST SRC
Preprocesses a float argument for use in subsequent sin/cos or ex2
operation, respectively.
mov lock CDST DST s[]
Tries to lock a word of s[] memory and load a word from it. CDST tells
you if it was successfully locked+loaded, or no. A successfully locked
word can't be locked by any other thread until it is unlocked.
mov unlock s[] SRC
Stores a word to previously-locked s[] word and unlocks it.
PREDICATE vote any/all CDST
This instruction doesn't use the predicate field for conditional execution,
abusing it instead as an input argument. vote any sets CDST to true iff the
input predicate evaluated to true in any of the warp's active threads.
vote all sets it to true iff the predicate evaluated to true in all acive
threads of the current warp.
set [CDST] DST <cmpop> f32/f64 SRC1 SRC2
Does given comparison operation on SRC1 and SRC2. DST is set to 0xffffffff
if comparison evaluats true, 0 if it evaluates false. if used, CDST.SZ are
set according to DST.
min f32/f64 DST SRC1 SRC2
max f32/f64 DST SRC1 SRC2
Sets DST to the smaller/larger of two SRC1 operands. If one operand is NaN,
DST is set to the non-NaN operand. If both are NaN, DST is set to NaN.
cvt <integer dst> <integer src>
cvt <integer rounding modifier> <integer dst> <float src>
cvt <rounding modifier> <float dst> <integer src>
cvt <rounding modifier> <float dst> <float src>
cvt <integer rounding modifier> <float dst> <float src>
Converts between formats. For integer destinations, always clamps result
to target type range.
add [sat] rn/rz f32 DST SRC1 SRC2
Adds two floating point numbers together.
mul [sat] rn/rz f32 DST SRC1 SRC2
Multiplies two floating point numbers together
slct b32 DST SRC1 SRC2 f32 SRC3
Sets DST to SRC1 if SRC3 is positive or 0, to SRC2 if SRC3 negative or NaN.
quadop f32 <op1> <op2> <op3> <op4> DST <srclane> SRC1 SRC2
Intra-quad information exchange instruction. Mad as a hatter.
First, SRC1 is taken from the given lane in current quad. Then
op<currentlanenumber> is executed on it and SRC2, results get
written to DST. ops can be add [SRC1+SRC2], sub [SRC1-SRC2],
subr [SRC2-SRC1], mov2 [SRC2]. srclane can be at least l0, l1,
l2, l3, and these work everywhere. If you're running in FP, looks
like you can also use dox [use current lane number ^ 1] and doy
[use current lane number ^ 2], but using these elsewhere results
in always getting 0 as the result...
add f32 DST mul SRC1 SRC2 SRC3
A multiply-add instruction. With intermediate rounding. Nothing
interesting. DST = SRC1 * SRC2 + SRC3;
fma f64 DST SRC1 SRC2 SRC3
Fused multiply-add, with no intermediate rounding.
texauto [deriv] live/all <texargs>
Does a texture fetch. Inputs are: x, y, z, array index, dref [skip all
that your current sampler setup doesn't use]. x, y, z, dref are floats,
array index is integer. If running in FP or the deriv flag is on,
derivatives are computed based on coordinates in all threads of current
quad. Otherwise, derivatives are assumed 0. For FP, if the live flag
is on, the tex instruction is only run for fragments that are going to
be actually written to the render target, ie. for ones that are inside
the rendered primitive and haven't been discarded yet. all executes
the tex even for non-visible fragments, which is needed if they're going
to be used for further derivatives, explicit or implicit.
texbias [deriv] live/all <texargs>
Same as texauto, except takes an additional [last] float input specifying
the LOD bias to add. Note that bias needs to be the same for all threads
in the current quad executing the texbias insn.
texlod live/all <texargs>
Does a texture fetch with given coordinates and LOD. Inputs are like
texbias, except you have explicit LOD instead of the bias. Just like
in texbias, the LOD should be the same for all threads involved.
texsize live/all <texargs>
Gives you (width, height, depth, mipmap level count) in output, takes
integer LOD parameter as its only input.
texfetch live/all <texargs>
A single-texel fetch. The inputs are x, y, z, index, lod, and are all
integer.
emit
GP-only instruction that emits current contents of $o registers as the
next vertex in the output primitive and clears $o for some reason.
restart
GP-only instruction that finishes current output primitive and starts
a new one.
bra <code target>
Branches to the given place in the code. If only some subset of threads
in the current warp executes it, one of the paths is chosen as the active
one, and the other is suspended until the active path exits or rejoins.
call <code target>
Pushes address of the next insn onto the stack and branches to given place.
Cannot be predicated.
ret
Returns from a called function. If there's some not-yet-returned divergent
path on the current stack level, switches to it. Otherwise pops off the
entry from stack, rejoins all the paths to the pre-call state, and
continues execution from the return address on stack. Accepts predicates.
breakaddr <code target>
Like call, except doesn't branch anywhere, uses given operand as the
return address, and pushes a different type of entry onto the stack.
break
Like ret, except accepts breakaddr's stack entry type, not call's.
quadon
Temporarily enables all threads in the current quad, even if they were
disabled before [by diverging, exitting, or not getting started at all].
Nesting this is probably a bad idea, and so is using any non-quadpop
control insns while this is active. For diverged threads, the saved PC
is unaffected by this temporal enabling.
quadpop
Undoes a previous quadon command.
bar sync <barrier number>
Waits until all threads in the block arrive at the barrier, then continues
execution... probably... somehow...
trap
Causes an error, killing the program instantly.
joinat <code target>
The arugment is address of a future join instruction and gets pushed
onto the stack, together with a mask of currently active threads, for
future rejoining.
brkpt
Doesn't seem to do anything, probably generates a breakpoint when enabled
somewhere in PGRAPH, somehow.
exit
Actually, not a separate instruction, just a modifier available on all
long insns. Finishes thread's execution after the current insn ends.
join
Also a modifier. Switches to other diverged execution paths on the same
stack level, until they've all reached the join point, then pops off the
entry and continues execution with a rejoined path.
Jump to Line
Something went wrong with that request. Please try again.