# Linux (for PlayStation®2) VCL Preprocessor User's Manual

© 2002 Sony Computer Entertainment Inc.

Publication date: July 2002

Sony Computer Entertainment Inc. 1-1, Akasaka 7-chome, Minato-ku Tokyo 107-0052, Japan

Sony Computer Entertainment America 919 E. Hillsdale Blvd. Foster City, CA 94404, U.S.A.

Sony Computer Entertainment Europe 30 Golden Square London W1F 9LD, U.K.

Unauthorized reproduction or distribution of this book, in whole or in part, is expressly prohibited by law.

PlayStation® and the PlayStation® logo ( ) are registered trademarks of Sony Computer Entertainment Inc. All other trademarks are property of their respective owners and/or their licensors.

gasp ©1996 The Free Software Foundation, Inc. (FSF).

# **Table of Contents**

| Overview                                                                                        | 1           |
|-------------------------------------------------------------------------------------------------|-------------|
| What is VCL?  Merging of Upper and Lower Instructions in One Code Stream  Syntax Simplification | 1<br>1<br>1 |
| Variable Naming and Registers Allocation                                                        | 1           |
| Instruction Scheduling<br>Macro Usage                                                           | 1<br>2      |
| Syntax Simplification  Magging of Upper and Lewer Instructions                                  | 3           |
| Merging of Upper and Lower Instructions Variable Naming and Registers Allocation                | 3           |
| Number Literals Peculiarities Register Availability                                             | 3           |
| Instruction Simplification Floating-Point Register Fields Specification                         | 4<br>6      |
| Broadcast Instructions                                                                          | 6           |
| Instruction Scheduling and Data Tracking Instruction Scheduling                                 | 7<br>7      |
| Loop Unrolling                                                                                  | 7           |
| Instructions Ordering<br>Branch Delay Slots                                                     | 8           |
| Code Removal<br>E, D, and T Bits                                                                | 9           |
| Load and Store Offsets                                                                          | 10          |
| Data Tracking Set Before Use                                                                    | 11<br>11    |
| Branching                                                                                       | 12          |
| Labels Calls to Functions                                                                       | 12<br>12    |
| Functions Calling Sub-Functions Recursive Functions                                             | 12          |
| Jump Tables                                                                                     | 13<br>13    |
| Integration of VSM Code Within VCL                                                              | 14          |
| .vsm / .endvsm and .raw / .endraw .rawloop / .endrawloop                                        | 14<br>14    |
| Macros and Other Preprocessor Usages                                                            | 16          |
| Using the C Preprocessor Using Macros with the C Preprocessor                                   | 16<br>16    |
| Using gasp Using Macros with gasp                                                               | 16<br>17    |
| Issues with gasp                                                                                | 17          |
| Examples of Preprocessor Usage  Command-Line Parameters                                         | 17<br>18    |
| Command-Line Syntax                                                                             | 18          |
| -с<br>-С                                                                                        | 18<br>18    |
| -d                                                                                              | 18          |
| -e<br>-f                                                                                        | 18<br>18    |
| <b>−</b> a                                                                                      | 18          |

| –G                                                    |                                                                                  | 18 |
|-------------------------------------------------------|----------------------------------------------------------------------------------|----|
| –h                                                    | l                                                                                | 18 |
| -l<                                                   | <includefilepath></includefilepath>                                              | 18 |
| <ir< td=""><td>nputfilename&gt;</td><td>19</td></ir<> | nputfilename>                                                                    | 19 |
| -K                                                    |                                                                                  | 19 |
| -L                                                    | -                                                                                | 19 |
| -m                                                    | n                                                                                | 19 |
| -N                                                    | Л                                                                                | 19 |
| –n                                                    | 1                                                                                | 19 |
| <b>-</b> o                                            | o <outputfilename></outputfilename>                                              | 19 |
| -P                                                    |                                                                                  | 19 |
| -S                                                    |                                                                                  | 19 |
| -t<                                                   | <seconds></seconds>                                                              | 19 |
|                                                       | <string></string>                                                                | 19 |
| -Z                                                    | <del>-</del>                                                                     | 20 |
| IZ                                                    |                                                                                  |    |
| Keywords                                              |                                                                                  | 21 |
| _                                                     | lobal symbolname                                                                 | 21 |
|                                                       | it_vi VIxx <, VIxx>                                                              | 21 |
|                                                       | it_vf VFxx <, VFxx>                                                              | 21 |
|                                                       | it_vi_all                                                                        | 21 |
|                                                       | it_vf_all                                                                        | 21 |
|                                                       | npg vucodeoffset                                                                 | 21 |
|                                                       | ame progname                                                                     | 21 |
|                                                       | aw / .endraw                                                                     | 21 |
|                                                       | awloop / .endrawloop                                                             | 21 |
|                                                       | yntax old   new                                                                  | 22 |
|                                                       | sm / .endvsm                                                                     | 22 |
|                                                       | barrier                                                                          | 22 |
|                                                       | cont                                                                             | 22 |
|                                                       | enter /endenter                                                                  | 22 |
|                                                       | _vi (VIxx) varname                                                               | 22 |
|                                                       | _vf (VFxx) varname                                                               | 22 |
|                                                       | _hw_acc acc / in_hw_clip clip / in_hw_i i / in_hw_p p / in_hw_q q / in_hw_status | 20 |
|                                                       | atus                                                                             | 22 |
|                                                       | exit /endexit<br>exitm macroname /endexit                                        | 22 |
|                                                       |                                                                                  | 22 |
|                                                       | ut_vi (VIxx) varname<br>ut_vf (VFxx) varname                                     |    |
|                                                       |                                                                                  | 23 |
|                                                       | ut_hw_acc acc / out_hw_clip clip / out_hw_i i / out_hw_p p / out_hw_q q          | 23 |
| L                                                     | LoopCS n,m                                                                       | 23 |
| Appendix A                                            | A: Macro Examples                                                                | 25 |
| Annendiy I                                            | B: Detailed Information Regarding Loops in VCL                                   | 41 |
|                                                       | pelining and VCL                                                                 | 41 |
|                                                       | -l oopCS n m" Directive                                                          | 43 |

# **About This Manual**

This manual provides a description of the various functionalities of the VCL Preprocessor (v1.3), for use with Linux (for PlayStation®2).

# **Changes Since Last Release**

None

# **Typographic Conventions**

Certain Typographic Conventions are used throughout this manual to clarify the meaning of the text:

| Convention  | Meaning                                                                                      |
|-------------|----------------------------------------------------------------------------------------------|
| courier     | Indicates literal program code.                                                              |
| italic      | Indicates names of arguments and structure members (in structure/function definitions only). |
| medium bold | Indicates data types and structure/function names (in structure/function definitions only).  |
| blue        | Indicates a hyperlink.                                                                       |

# What is VCL?

The VCL preprocessor is an application that was developed to simplify some of the complex and tedious tasks associated with assembly-level programming of the VU processor. These tasks include:

- Dual pipeline processing
- Loop unrolling
- Register allocation
- Instruction scheduling

VCL outputs a standard VSM/DSM file (that can be compiled using ee-dvp-as). It is available for both the Linux and Win32 platforms.

# Merging of Upper and Lower Instructions in One Code Stream

VCL simplifies VU programming by merging upper and lower instructions. Pairing of instructions is no longer required.

# **Syntax Simplification**

In some cases, standard VU programming requires the programmer to specify which register will be used as a parameter in instructions such as MULQ. VCL performs the proper instruction assignment, and only requires the programmer to specify the more generic MUL. The rest is deduced from parameters attached to the instruction in question.

# **Variable Naming and Registers Allocation**

VCL provides for variable naming. Self-explanatory variable names like 'vertexptr' and 'vertexcolor' are permissible, instead of using standard register names such as 'vi02' or 'vf06'.

# **Instruction Scheduling**

Assembly language programming for high-performance applications requires intimate knowledge of instruction timing (throughput and latency). Because of this, instruction scheduling is often very time consuming.

Data tracking is a tedious task because error messages regarding suspicious variable-related issues (use before set, for example) are not provided. High-performance code can appear to be confusing, especially if two blocks of code not otherwise related are reorganized into one block, for speed purposes.

While knowledge of instructions timing is still highly recommended, use of VCL will simplify instruction scheduling, keeping the code in a logical order, without compromising performance. VCL is aware of timing and dependencies, and in most cases will generate code that rivals hand-tuned code.

One of the most powerful features of VCL is related to loop unrolling. (See "Loop Unrolling" for more details.)

# Macro Usage

With C-preprocessor or GASP, macro usage has always been possible, even without VCL. Use is limited however, as instructions must be paired, and that this pairing may not be broken down. However, because VCL programming is single-streamed, it reinforces the power offered by macros.

# **Syntax Simplification**

VCL offers two syntax schemes, simply referred to as "old" and "new". The "old" scheme is standard VU programming, where instructions must be specified using the full name, and where fields must be specified in instructions, as well as the registers to which they belong.

The "new" one attempts to simplify coding and code readability. To enable its use, the following must be added to the source file before any instructions:

```
.syntax new
```

An alternative way of enabling it is to specify "-n" as a command-line argument to VCL.

Note that merging of upper and lower instructions, as well as register naming, happens with either syntax. This section describes aspects of the "new" syntax.

### Merging of Upper and Lower Instructions

Because VCL manages instruction scheduling, it is no longer necessary to pair instructions. All instructions (from either the upper or lower pipeline) can now be ordered sequentially, in a single stream.

### **Variable Naming and Registers Allocation**

The use of named variables is permissible and encouraged. Besides making code more readable, named variables allow VCL to manage register allocation.

It is allowable to tie a specific register to a named variable and use register names directly. (See "Instruction Scheduling" for more details.) However, this limits the register allocation process.

The following are acceptable examples of variable naming usage:

```
IADDIU
             inputptr, vi00, 32
LQ
             vertex, 0(inputptr)
MAX
             vector1111, vf00, vf00[w]
```

VCL checks to see if a variable is used before being set, and will output an error message if that is the case.

#### Floating-Point and Integer Variable Naming

VCL tracks floating-point and integer variable names separately. Although not recommended, it is still possible to use the same variable name for both the floating point and the integer. Note that each will have to be initialized separately prior to use.

#### Special Variable Name "i"

To ease the porting of C code to VCL, the variable name "i" is permitted for an integer register. Assuming such a named variable isn't used with upper instructions, VCL will be able to discern between the hardware register I, and the variable. However, use of this variable name is discouraged, as it inhibits one of VCL's primary functions, which is to make code more readable.

#### **Number Literals Peculiarities**

In cases where a number literal must be specified, it is possible to specify instead a string, which will be assumed to be a defined value, and will therefore be ported as-is to the output file. Care must be taken, as VCL might not always be able to differentiate a typo from valid code. As an example, if you want to write:

#### 4 Syntax Simplification

LOI 0x3F

but write instead:

LOI x3F

VCL will accept it, only to have EE-DVP-AS reject it, with what could look like a cryptic message.

# **Register Availability**

It is necessary to let VCL know which registers are available for it to use. This is accomplished by using the following keywords:

".init\_vi" and ".init\_vf" are used to specify, respectively, which integer and floating-point registers are available. Specifying "vi00" or "vf00" is illegal, and will result in an error message. ".init\_vi\_all" and ".init\_vf\_all" may be alternatively used in lieu of specifying every single register by hand. Having both ".init\_vi" and ".init\_vi all", for example, is illegal.

### **Instruction Simplification**

Many VU instructions stem from a single, more generic instruction. Such is the case for ADD, ADDi, ADDq, ADDbc, ADDAi, ADDAbc.. In this case, the stem instruction would be ADD.

VCL accepts the replacement of the specific instructions by the stem instruction, as the specific instruction may be deduced by the parameters attached to the instruction. Therefore, a program could use:

And VCL would convert this (possibly) to:

ADDq.xyzw vf03xyzw, vf03xyzw, q

The following table lists all instructions affected by the syntax simplification:

Table 1

| Original Instruction | Simplified stem instruction |
|----------------------|-----------------------------|
| ADD                  | ADD                         |
| ADDi                 |                             |
| ADDq                 |                             |
| ADDbc                |                             |
| ADDA                 |                             |
| ADDAi                |                             |
| ADDAq                |                             |
| ADDAbc               |                             |

|                      | -                           |
|----------------------|-----------------------------|
| Original Instruction | Simplified stem instruction |
| SUB                  | SUB                         |
| SUBi                 |                             |
| SUBq                 |                             |
| SUBbc                |                             |
| SUBA                 |                             |
| SUBAi                |                             |
| SUBAq                |                             |
| SUBAbc               |                             |
| MUL                  | MUL                         |
| MULi                 |                             |
| MULq                 |                             |
| MULbc                |                             |
| MULA                 |                             |
| MULAi                |                             |
| MULAq                |                             |
| MULAbc               |                             |
| MADD                 | MADD                        |
| MADDi                |                             |
| MADDq                |                             |
| MADDbc               |                             |
| MADDA                |                             |
| MADDAi               |                             |
| MADDAq               |                             |
| MADDAbc              |                             |
| MSUB                 | MSUB                        |
| MSUBi                |                             |
| MSUBq                |                             |
| MSUBbc               |                             |
| MSUBA                |                             |
| MSUBAi               |                             |
| MSUBAq               |                             |
| MSUBAbc              |                             |
| MAX                  | MAX                         |
| MAXi                 |                             |
| MAXbc                |                             |
| MINI                 | MINI                        |
| MINIi                |                             |
| MINIbc               |                             |
|                      |                             |

## **Floating-Point Register Fields Specification**

VCL only requires that a floating-point register field be specified next to the instruction itself, as opposed to specifying it on the instruction and the register that it belongs to. Therefore, the following would be used:

```
ADD.xyz
                      newvertexposition, vertexposition, translation
Instead of:
         ADD.xyz
                  vf04xyz, vf03xyz, vf02xyz
```

Specifying none is understood to be the same as specifying all (xyzw). Therefore, the following 2 cases would be equivalent:

```
MAX color, color, vector0000 MAX.xyzw color, color, vector0000
```

#### **Broadcast Instructions**

When a field used as a broadcast is specified in the instruction (such as ADDbc or MADDbc), VCL requires the specified field to be next to the associated register. The following example is a typical vertex multiplication by a matrix.

```
acc, matrix0, inputvertex[x]
acc, matrix1, inputvertex[y]
MUL
MADD acc, matrix1, inputvertex[y] MADD acc, matrix2, inputvertex[z]
MADD finalvertex, matrix3, inputvertex[w]
```

# Instruction Scheduling and Data Tracking

### Instruction Scheduling

Another useful feature of VCL is instruction rescheduling, which maximizes execution speed and code compactness. VCL tracks timing for each instruction (throughput and latency), and will try to reorder instructions to minimize stalls and maximize efficiency. In some rare cases (where possible), it will also try to move instructions from the upper to lower pipeline, and vice-versa.

VCL also tracks the I, ACC, Q and P registers, as well as the CLIP flags, and will generate proper delays between related instructions to insure valid code generation.

If, for some reason, you want to avoid instruction rescheduling past a certain point, simply insert the keyword:

```
--barrier
```

Important: Under certain circumstances, using the following line of code in a VSM/DSM file causes a problem:

```
NOP[E]
             XGKICK
                           VTxx
```

To avoid the problem, placing these two instructions on two different lines:

```
NOP
              XGKICK
                            VIxx
NOP[E]
              NOP
```

To ensure such code isn't generated, follow an XGKICK instruction with "--barrier".

# Loop Unrolling

Instead of repeating the same code many times in a row, VCL allows loop unrolling by specifying the following, at the beginning of a loop:

```
--LoopCS
              n,m
```

The loop ends when a branch instruction to the beginning of the loop is encountered. The following would constitute a valid loop:

```
LoopStart:
--LoopCS 3,3
             inputvertex, 0(inputptr++)
LQI
                        matrix0, inputvertex[x]
MUL
             acc,
                        matrix1, inputvertex[y]
MADD
             acc,
                        matrix2, inputvertex[z]
MADD
             acc,
             finalvertex, matrix3, inputvertex[w]
MADD
SQI
             finalvertex, 0(outputptr++)
IBNE
             inputptr, endbuffer, LoopStart
```

At this time, loop unrolling is limited to simple loops without conditional branching. While VCL will not fail or give any error message if a branch is encountered inside the loop, the code generated will be less than optimal.

If the loop in question is performing clipping operations, it is possible to set the ADC bit without actually doing any conditional branch:

```
CLIPW.xyz clipvtx, ClipData ; Trigger clip calculations
FCAND vi01, 0x3FFFF ; Set if any of previous 3 vtx is clipped
IADDIU adc_bit, vi01, 0x7FFF
ISW.w adc bit, outyl xyzf2(output buffer) ; Set if clipped
ISW.w
                       adc bit, outvl xyzf2(output buffer) ; Set if clipped
```

Loop unrolling almost always results in a proloque followed by the main loop body, then concluded by an epilogue. The size of the prologue and epilogue depends partly on the parameters given to "--LoopCS", but also depends on how well the code can be rescheduled.

The "--LoopCs" keyword takes two parameters: n (minimum number of loops) and m (slop count).

#### n (Minimum Number of Loops)

As it unrolls the loop, VCL needs to know the loop's minimum iteration count. This will allow it to potentially move instructions with side effects (stores, for example) higher in the execution pipeline, sometimes even before any conditional branch is executed. This translates into tighter (hence faster) code.

#### m (Slop Count)

The slop count describes how many output iterations can be done without overwriting data past the end of an output array. A value of 1 would indicate that it is safe to execute an output instruction one iteration ahead of the current iteration, allowing for better instructions scheduling in some cases.

If the input is 30 vertices and "--LoopCS 3, 3" is specified, a 33-vertex output buffer will be required. If the input is less than 30 vertices, specify "--LoopCS 3,0" instead. Note that the VCL may not process vertices ahead. In this case it will not matter if m is equal to 3, or even 500. It will behave the same as if m is 0.

**Tip:** The number of loop iterations is often based on a counter, which is decremented once per iteration. The loop is repeated until that counter reaches 0. A vertex counter is a good example of this. If the loop happens to be running through a given array, it is worth nothing that instead of using a counter, the address of the buffer ending may be calculated ahead of the loop. Then, instead of comparing the counter to 0, compare the current array pointer to the end pointer for equality. The end result is a saved instruction (the decrementing of the counter), tighter loop, and therefore faster code.

(Refer to Appendix B for more information.)

### **Instructions Ordering**

Care must be taken to place instructions in the order in which they are to take place. For example, standard VU programming would allow a DIV instruction right above an MULQ. The result of the previous DIV would be used, as is it known that the result of a DIV instruction isn't available for seven cycles. With VCL, the MULQ must be placed before DIV. VCL will reposition them as it sees fit, to maximize execution speed and code compactness.

#### Memory Aliasing and Instructions Reordering

VCL reorders instructions while preserving logical order. Instructions modifying a given variable will always appear in the same relative order to each other as they are in the input file. However, VCL has no explicit knowledge of memory, and more specifically, of memory aliasing.

Memory aliasing occurs when there are two different pointers potentially pointing to the same area in memory. In such a case, it is important that instructions reading and writing to both pointers' memory be kept in the same relative order.

Steps must be taken to let VCL know that two pointers are potentially referring to the same memory. By simply appending a suffix to all instructions related to the aliased pointers, it will know to preserve the relative order. The following is an example for "ptr1" and "ptr2":

```
var1, 0(ptr1):memgroup1
LQ
      var2, 3(ptr2):memgroup1
```

Here, "memgroup1" could be any valid string. It ensures that the store (SQ) always takes precedence to the load (LQ).

Within one program, more than one group may be used.

#### **Peculiarities with XGKick**

Moving XGKick instructions could potentially result in hazardous code. For example, the XGKick could be moved ahead of stores to the buffer to be XGKicked, To avoid any such cases, XGKick will never be moved before or after a store instruction.

### **Branch Delay Slots**

VCL handles branch delay slots independently. Placing any instruction immediately after a branch will cause the instruction in question to be executed only if the branch is not taken (in the case of a non-returning branch) or when the program counter comes back from a sub-routine (in the case of a function call such as BAL).

#### Code Removal

VCL will recognize if a code block isn't reachable and will remove it. This saves VU micro memory in the process.

Empty lines (NOP NOP) are also removed, and will be replaced if necessary by either a WAITQ or WAITP instruction. Any stall introduction by instruction removal is noted by VCL as a comment in the output file.

#### Floating-Point Field Pruning

VCL keeps track of which fields of a floating-point register are used, and will prune any field that does not need to be used. This step allows VCL to more effectively schedule instructions under some circumstances. The following is an example of pruning:

```
LQ color, 2(inputptr)
ADD color, color, ambientcolor
MINI color, color, vector1111
SQ.xyz color, 2(outputptr)
```

Assuming that "color" isn't used later in the code, VCL will prune the w field in the three first instructions. The code effectively becomes the same as:

```
LQ.xyz color, 2(inputptr)
ADD.xyz color, color, ambientcolor
MINI.xyz color, color, vector1111
SQ.xyz color, 2(outputptr)
```

## E, D, and T Bits

Putting an [E], [D] and/or [T] bit on an instruction is valid with VCL. However, as it is rescheduled, the bit will move around with the instruction it is attached to. Also, if the instruction it is attached to is duplicated (in such cases as loop unrolling), the bit will be duplicated as well.

Instead of using the [E] bit, it is recommended to use the VCL keyword:

```
--cont
```

"--cont", which stands for "continue", lets VCL know the program will restart from this point. Effectively, VCL inserts a "NOP[E] NOP" line. The main advantage of using this keyword over inserting an explicit [E] bit is that it removes any danger of VCL moving the [E] bit somewhere unexpected. Another solution to this problem is to attach the bit to a label such as:

```
Label: [D]
```

Also note, as with "--barrier", instructions are not rescheduled beyond a "--cont".

#### **Load and Store Offsets**

When a loop containing load and store instructions LQ and SQ is unrolled, VCL will most likely alter the offset parameter to at least some of the instructions. This is a result of moving the instruction past the increment of the pointer the offset relates to. The following is an example:

```
vertex, 0(inputptr)
stq, 1(inputptr)
LQ
LQ
           rgb, 2(inputptr)
IADDIU inputptr, inputptr, 3
```

A possible (and simplistic) unrolling of such an instruction sequence would be:

```
LQ vertex1, 0(inputptr)
LQ stq1, 1(inputptr)
LQ rgb1, 2(inputptr)
IADDIU inputptr, inputptr, 3
 . . .
LQ vertex2, 0(inputptr)
LQ stq2, 1(inputptr)
LQ rgb2, 2(inputptr)
IADDIU inputptr, inputptr, 3
LQ vertex3, 0(inputptr)
LQ stq3, 1(inputptr)
LQ rgb3, 2(inputptr)
IADDIU inputptr, inputptr, 3
```

However, VCL might instead unroll a sequence as follows, which produces the same result, but under certain conditions provides better instruction rescheduling:

```
LQ vertex1, 0(inputptr)
LQ stq1, 1(inputptr)
IADDIU inputptr, inputptr, 3
LQ rgb1, -1(inputptr)
 . . .
LQ vertex2, 0(inputptr)
LQ rgb2, 2(inputptr)
IADDIU inputptr, inputptr, 3
LQ stq2, -2(inputptr)
LQ vertex3, 0(inputptr)
LQ stq3, 1(inputptr)
LQ rgb3, 2(inputptr)
IADDIU inputptr, inputptr, 3
```

Note that using LQI, LQD, SQI, and SQD will prevent the possibility of such optimizations. However, their use has proven to be more effective under certain conditions such as very tight loops.

#### **Data Tracking**

VCL tracks data usage for many reasons. If VCL determines that a variable is set but not used afterwards, it will remove the setting instructions. If these settings need to remain in the code for the purpose of eventually passing the variable as a parameter out of the code block, use the following keyword:

```
out vi (VIxx) intvarname
out vf (VFxx) floatvarname
```

"VIxx" and "VFxx" are an integer and a float register, respectively. The names "intvarname" and "floatvarname" correspond to these registers. VCL will make sure the specified register will contain the named variable value. These keywords must appear between the two keywords "--exit" and "-endexit", or between "--exitm" and "--endexit". (See "-exit / -endexit" for more details.)

Note that similar instructions exist for ACC (out hw acc acc), I (out hw i i), P (out hw p p), Q (out hw q q) registers, and the CLIP flags (out hw clip clip).

#### **Set Before Use**

If a variable is found to be used before it is even initialized, VCL will output an error message. If it is necessary to pass in a variable as a parameter from a different block, or from standard VSM/DSM code, use the following keywords:

```
in vi (VIxx) intvarname
in vf (VFxx) floatvarname
```

These keywords must appear between the two keywords "--enter" and "--endenter". (See "-enter/ -endenter" for more details.)

Specifying "in vi" and "in vf" does not automatically mean the link between the register and the variable name will remain for the rest of the program, even if the same link is specified using "out vi" or "out vf". VCL reserves the right to break that link at any point.

Note that similar instructions exist for ACC (in hw acc acc), I (in hw i i), P (in hw p p) and Q (in hw q q) registers, as well as the FPU results flags (in hw status status) and the CLIP flags (in hw clip clip).

An alternative to using "in hw clip clip" is to have the following instruction before any clipping-related instruction:

```
FCSET 0
```

This will let VCL know all clipping flags are initialized.

VCL does not keep track of all flags. It is aware of MAC flags, but not about sticky bits.

To initialize a variable to 0, it is illegal to use the following:

```
varname, varname, varname
```

Instead, use one of the two following methods:

```
varname, vf00, vf00
MFIR varname, vi00; Flags are left untouched!
```

# **Branching**

#### Labels

With VCL, labels are defined in the same way as under VSM/DSM. However, for a label to remain in the output file, it must be positioned between the "--enter"/"--endenter" and "--exit"/"--endexit" keywords (or "--exitm"/"--endexit"). If "--exit"/"--endexit" is omitted, the label can be positioned after "--enter"/"--endenter".

#### Calls to Functions

VCL currently does not support far function calls (calls to external functions). Therefore, any functions called must be included in the same file as the caller, either directly or via a header file (for cpreprocessor) or include file (for gasp).

Within VCL, a function is treated like any branch. There is no special method of passing parameters in and out. The following code is an example:

```
IADDIU vertexptr, vi00, 64
IADDIU endptr, vertexptr, 2
                   endptr, vertexptr, 20
       BAL
                   retaddress, TransformVerticesInPlace
       . . .
TransformVerticesInPlace:
      LQ vertex, 0 (vertexptr)
       ; *** Matrix used without being initialized! ***
      MatrixMultiplyVertex vertex, matrix, vertex
                    vertex, 0(vertexptr++)
       IBNE
                   vertexptr, endptr
       JR
                    retaddress
```

In this case (for register scope), the function "TransformVerticesInPlace" behaves the same as if it were in-lined. Assuming the matrix hasn't been initialized before the function call, an error would be given to this effect.

# **Functions Calling Sub-Functions**

It is valid for a function to call a sub-function. However, be aware that returning addresses conflicts will result in an infinite loop.

#### **Recursive Functions**

VCL does not support recursive functions natively. However, it is possible for a program to maintain its own stack, push the return address register before calling the function recursively, pop the address on return, and eventually return, as is shown in the example below:

```
RecursiveFunction:
ISUBIU counter, counter, 1
            counter, vi00, RecFcnEnd
IBEQ
ISUBIU stackptr, stackptr, 1
ISWR.x retaddress, (stackptr):VCLSML_STACK
BAL retaddress, RecursiveFunction):VCLSML_STACK
ILWR.x retaddress, (stackptr):VCLSML_STACK
IADDIU stackptr, stackptr, 1
RecFcnEnd:
   retaddress
JR
```

# **Jump Tables**

VCL currently does not support jump tables. Such feature must therefore be partly implemented in standard VSM/DSM code. However, each function of the jump table may be coded separately using VCL.

# Integration of VSM Code Within VCL

Sometimes it may be necessary to incorporate traditional VSM/DSM code within VCL code. This section introduces two methods to do so, and explains how they are used.

#### .vsm / .endvsm and .raw / .endraw

This is used for inserting pre-formatted, pre-ordered VSM code. Instructions therefore have to be organized in the original 2 stream (upper and lower). However, variables may still be named, and use the same ones as non-VSM code. Checks for data use before set will still be performed.

```
".raw" / ".endraw" is simply an alternative spelling for ".vsm" / ".endvsm".
```

Note that code within a VSM block isn't rescheduled. The block acts the same way as "--barrier", in that instruction rescheduling is not performed across the block. You must consider whether or not the use of such a block contributes to better overall performance.

Also note that because such block is meant for code only, it must appear between "--enter" / "-endenter" and "--exit" / "--endexit" (or "--exitm" / "--endexit". It must not be thought of as a black box that is ported as-is to the output file. For the same reason, specifying keywords like ".equ" inside such a block is invalid.

The following is an example of a .vsm / .endvsm block:

```
IADDUI matrixptr, vi00, 0
MatrixLoad mat1, 0, (matrixptr)
MatrixLoad mat2, 4, (matrixptr)
; Swap matrices
max mat1[1], mat2[1], mat2[1] move mat2[1], mat1[1] max mat1[2], mat2[2], mat2[2] move mat2[3], mat1[3] move mat2[3], mat1[3]
.endvsm
MatrixSave mat1, 0, (matrixptr)
MatrixSave mat2, 4, (matrixptr)
```

#### .rawloop / .endrawloop

These directives are used to enclose a pre-formatted VSM code block to be unrolled. More control is permitted as far as prologue and epilogue creation goes. There following is an example:

```
.rawloop
loop:
          --LoopCS 10,10
         5..ftoi4.xyz ixyz, sxyz 1..lqi vrt, (ptr++)
4..mul.xyz sxyz, nxfrm, q nop
             nop
                                                                            nop
         1.mul acc, m[3], vf00[w] 2.move nxfrm, xfrm
1.madd acc, m[0], vrt[x] 2.div q, vf00[w], xfrm[w]
1.madd acc, m[1], vrt[y] 1.ibne ptr, end_ptr, loop
1.madd xfrm, m[2], vrt[z] 5.sqi ixyz, (optr++)
          .endrawloop
```

The "--LoopCS" keyword is used the same way as described in "Loop Unrolling". While building the prologue, the instructions with the lowest numbers are introduced first, and follow this order: 1 1-2 1-2-3 1-2-3-4 ... etc. Then the loop itself and the epilogue are created in accordance with the prologue.

(Refer to Appendix B for more information.)

# Macros and Other Preprocessor Usages

Using macros with VU code has always been possible, using tools such as the C preprocessor and gasp. The use of macros is also possible with VCL. In fact, VCL offers even more opportunities to use these tools, as the input is single-streamed, as opposed to the double-streamed VSM/DSM programming style.

### Using the C Preprocessor

VCL can automatically pipe the code through the C preprocessor by giving it the "-G" (uppercase G) command-line parameter.

Using it will allow the use of all preprocessor directives:

```
#if, #ifdef, #ifndef
#elif, #else
#endif
#include
#define
#undef
#line
#error
```

# Using Macros with the C Preprocessor

Use of macros with the C preprocessor is identical to that of C/C++. However, using macros with a label inside proves to be an interesting task. The only way around doing this is to include an extra parameter to the macro, which will then be used internally to generate a unique label, using the ## token concatenation directive:

```
#define mymacro(param1, param2, uniqueid)
   IADDIU Counter, vi00, 5
mymacrolabel ## uniqueid:
  ISUBIU Counter, Counter, 1
            Counter, vi00, mymacrolabel ## uniqueid \
}
```

Each call to "mymacro" within a given file (including all the included header files) must pass a unique identifier for "uniqueid".

### Using gasp

Just as with the C preprocessor, VCL can automatically pipe the code through gasp, by giving it the "-q" (lowercase g) command-line parameter.

Using gasp will allow the use, among others, of the following directives:

```
.assign
.include
.macro
.endm
.end
```

Refer to the gasp documentation for more information on the subject. Online documentation may be found at the following URLs:

http://www.objsw.com/docs/gasp\_toc.html

http://sunsite.utk.edu/gnu/binutils/gasp\_toc.html

http://case.ispras.ru/PublicScripts/cgi-bin/lib.cgi/gnu/gasp\_toc.html

# **Using Macros with gasp**

As shown in the following example, using macros with gasp is simple:

```
.macro mymacro param1,param2
IADDIU Counter, vi00, 5
mymacrolabel\@:
ISUBIU Counter, Counter, 1
IBNE Counter, vi00, mymacrolabel\@
.endm
```

gasp will replace the "\@" by a number, which is incremented with each instance of a macro, so the same macro may be used many times in the same source file.

### Issues with gasp

With some versions of gasp, including gasp, number literals aren't translated properly to the output file. Examples of failed conversions are:

```
0x123 (converted to) 0
0.000123 (converted to) 0.1
```

The way to fix this, for now, is to use the following syntax:

```
0x123 (switch to) H'123
0.000123 (switch to) 1.23E-4 or (0.0001)
```

Also, any line without a space or tab in front of the first non-white character will be converted as though the first word was a label. The newly created label will be suffixed by a semi-column.

# **Examples of Preprocessor Usage**

Refer to Appendix A for examples.

## **Command-Line Parameters**

VCL is a command-line-based preprocessor. Various parameters may be passed to it, all of which are described in this chapter.

# **Command-Line Syntax**

VCL must be called with parameters, following the syntax:

```
[-cCdefgGhKLmMnPSZ] [-I<includefilepath>] [-t<seconds>]
[-o<outputfilename] [-u<string>] <inputfilename>
```

-c

Emit nearly original source code as comments.

-C

Disable the code reduction pass.

-d

Dumb code is generated. For example, rescheduling of instructions isn't performed.

-е

Disable the generation of [E] bits at the end of the code. Alternatively, the use of -exitm without an argument may be used.

-f

Disable the generation of alignment directives (.align n).

**-g** 

Run gasp on the input before any VCL-specific task is done. gasp is called with the following parameter string: "-p -s -c ';'". "-l" is also passed if specified.

-G

Run the C preprocessor on the input before any VCL-specific task is done.

-h

Print out the command-line help.

# -l<includefilepath>

To be used with "-g"; tells gasp where to find include files.

### <inputfilename>

Specify the name of the VCL source file. If it is not specified, VCL will read from the standard input.

#### -K

The temporary files created by the pre-processors are not deleted. The file locations being OS-dependant, refer to the VCL output to find them.

#### -L

Globally disable loop code generation.

#### -m

Generate ".mpg" and DMA tags automatically. This may be used as an alternative to ".mpg" within the VCL source file.

#### -M

VCL retains the relative order of load and store instructions (known as the timid memory access mode). Note: -M will be deprecated in future versions.

#### -n

Enable the new syntax. This may be used as an alternative to ".syntax new" within the VCL source file. (See "Syntax Simplification" for more details.)

# -o<outputfilename>

Output file name. Specifying the same name as the source is invalid. If not specified, the result is outputted to the standard output.

# -P

Disable the removal of unused instructions and pruned fields pass. (See "Code Removal" for more details.) **Note:** -P will be deprecated in future versions.

#### -S

The content of loops starting with "--LoopCS" will be reorganized to stagger the read and write instructions, and to facilitate memory access by the VIF and GIF.

#### -t<seconds>

Specify the optimizer timeout. <second> must be 1 or higher. Default is 4. Note that this has the potential side-effect of generating different code on different computers, as the processor speed is not taken into account.

## -u<string>

<string> is used as a unique string for label generation, instead of the file name. Useful if the filename is especially long.

# -Z

Disable the immediate field fix up pass.

**Note:** -Z will be deprecated in future versions.

# **Keywords**

### .global symbolname

The directive ".global", along with "symbolname", is ported as-is to the output file. "symbolname" is therefore only assumed to exist and be valid.

### .init\_vi Vlxx <, Vlxx ...>

Inform VCL that the specified integer registers are available for use. Specifying VI00 is illegal, as it is always considered available. (Specifying ".init\_vi" and ".init\_vi\_all" in the same file is illegal. Note that VCL might fail to process code if not given enough registers.

# .init\_vf VFxx <, VFxx ...>

Inform VCL that the specified float registers are available for use. Specifying VF00 is illegal, as it is always considered available (Compare to 8.4.). Specifying ".init\_vf" and ".init\_vf\_all" in the same file is illegal. Note that VCL might fail to process code if not given enough registers.

### .init\_vi\_all

Inform VCL that all integer registers are available for use. Specifying ".init vi" and ".init vi all" in the same file is illegal.

#### init vf all

Inform VCL that all float registers are available for use. Specifying ".init vf" and ".init vf all" in the same file is illegal.

#### .mpg vucodeoffset

Add "ret" DMA tags around the code generated by VCL, for better integration of VCL code with original VSM/DSM code. If ".name" is also specified, two labels will be added, following the syntax "(progname) DmaTag" and "(progname) DmaEnd". "vucodeoffset" is assumed to be a valid address.

# .name progname

Add two labels, one before the code generated by VCL and the other after, following the syntax "(progname)\_CodeStart" and "(progname)\_CodeEnd". For better integration of VCL code with original VSM/DSM code. The labels created are also made available globally, via the directive ".global".

#### .raw / .endraw

Enclose pre-formatted, original VSM-style code. Same as "vsm /.endvsm". (See ".vsm / .endvsm and .raw / .endraw" for more details.)

### .rawloop / .endrawloop

Enclose a pre-formatted, original VSM-style code loop, to be unrolled. (See ".rawloop / .endrawloop" for more details.)

# .syntax old | new

If "old" is specified, the syntax is the same as original VSM/DSM code. "new" specifies the new and simplified syntax. (See "Syntax Simplification" for more details.)

#### .vsm / .endvsm

Enclose pre-formatted, original VSM-style code. Same as ".raw" / ".endraw". (See ".vsm / .endvsm and .raw / .endraw" for more details.)

#### --barrier

Prevent the rescheduling of instruction to go across this line. (See "Instruction Scheduling" for more details.)

#### --cont

Mark a point where a program temporarily stops, and may be restarted from, via a MSCNT. A [E] flag is inserted at this point. (See "E, D and T Bits" for more details.)

### --enter / --endenter

Specify an entry point to VCL code. Any file must have at least one entry point, but may have more than one.

# in\_vi (VIxx) varname

Must be specified between "--enter" and "--endenter". Bind a specific integer to a specific variable name at entry. The register is considered pre-initialized, presumably by standard VSM/DSM code. Such binding is not guaranteed to persist for the duration of the code block. (See "Set Before Use" for more details.)

#### in vf (VFxx) varname

Must be specified between "--enter" and "--endenter". Bind a specific float to a specific variable name at entry. The register is considered pre-initialized, presumably by standard VSM/DSM code. Such binding is not guaranteed to persist for the duration of the code block. (See "Set Before Use" for more details.)

# in\_hw\_acc acc / in\_hw\_clip clip / in\_hw\_i i / in\_hw\_p p / in\_hw\_q q / in\_hw\_status status

Must be specified between "--enter" and "--endenter". The specified register is considered pre-initialized, presumably by standard VSM/DSM code. (See "Set Before Use" for more details.)

#### --exit / --endexit

Specify an exit point to VCL code. Its use is mandatory only if outputting parameters is necessary.

#### --exitm macroname / --endexit

Specify an exit point to VCL code. Its use is mandatory only if outputting parameters is necessary. Unlike "--exit", "--exitm" lets you specify a macro name, which will be sent as-is to the output file, therefore permitting custom ending code. (See "bla bla bla" for more details).

### out vi (VIxx) varname

Must be specified between "--exit" and "--endexit" or between "--exitm" and "--endexit". Bind a specific integer to a specific variable name at exit. The variable may then be passed to other VCL blocks, or to original VSM/DSM code. Such binding is not guaranteed to persist for the duration of the code block. (See "Set Before Use" for more details.)

### out\_vf (VFxx) varname

Must be specified between "--exit" and "--endexit" or between "--exitm" and "--endexit". Bind a specific float to a specific variable name at exit. The variable may then be passed to other VCL blocks, or to original VSM/DSM code. Such binding is not guaranteed to persist for the duration of the code block. (See "Set Before Use" for more details.)

# out hw acc acc / out hw clip clip / out hw i i / out hw pp / out hw qq

Must be specified between "--exit" and "--endexit", or between "--exitm" and "--endexit". The specified register may then be passed to other VCL blocks, or to original VSM/DSM code. (See "Data Tracking" for more details.)

### --LoopCS n,m

Mark a portion of code as being a loop, and instruct VCL to unroll it. "n" is the minimum iteration of the loop, and "m" (slop count) is the amount of output iterations that can be done without overwriting data past the end of an output array. (See "Loop Unrolling" for more details.)

# **Appendix A: Macro Examples**

All macros in this appendix may be found in the file VCL\_SML.i, included with the VCL distribution. All can be used as-is with gasp (via the "-g" command-line parameter), but all can easily be converted to be used by the C preprocessor.

```
;//-----
;// MatrixLoad - Load "matrix" from VU mem location "vumemlocation" +
;//----
  .macro MatrixLoad matrix, offset, vumemlocation
             \matrix[0], \offset+0(\vumemlocation)
             \matrix[1], \offset+1(\vumemlocation)
             \matrix[2], \offset+2(\vumemlocation)
  lq
             \matrix[3], \offset+3(\vumemlocation)
  lq
  .endm
://-----
;// MatrixSave - Save "matrix" to VU mem location "vumemlocation" +
  .macro MatrixSave matrix, offset, vumemlocation
             \matrix[0], \offset+0(\vumemlocation)
             \matrix[1], \offset+1(\vumemlocation)
             \matrix[2], \offset+2(\vumemlocation)
  sq
             \matrix[3], \offset+3(\vumemlocation)
  sa
  .endm
;//-----
;// MatrixIdentity - Set "matrix" to be an identity matrix
;// Thanks to Colin Hugues (SCEE) for that one
;//-----
  .macro MatrixIdentity matrix
             \matrix[0], vf00, vf00[w]
  add.x
  mfir.yzw
             \matrix[0], vi00
  mfir.xzw
             \matrix[1], vi00
             \matrix[1], vf00, vf00[w]
  add.y
  mr32
             \matrix[2], vf00
             \matrix[3], vf00, vf00
  max
  .endm
;// MatrixCopy - Copy "matrixsrc" to "matrixdest"
;// Thanks to Colin Hugues (SCEE) for that one
;//-----
  .macro MatrixCopy matrixdest, matrixsrc
             \matrixdest[0], \matrixsrc[0], \matrixsrc[0]
  max
             \matrixdest[1], \matrixsrc[1]
  move
             \matrixdest[2], \matrixsrc[2], \matrixsrc[2]
  max
             \matrixdest[3], \matrixsrc[3]
  move
  .endm
```

```
;//-----
;// MatrixSwap - Swap the content of "matrix1" and "matrix2"
;// The implementation seems lame, but VCL will convert moves to maxes
;// if it sees fit
;//----
   .macro MatrixSwap matrix1, matrix2
                vclsmlftemp, \matrix1[0]
  move
                 \matrix1[0], \matrix2[0]
  move
                 \matrix2[0], vclsmlftemp
  move
                 vclsmlftemp, \matrix1[1]
  move
                 \matrix1[1], \matrix2[1]
  move
                 \matrix2[1], vclsmlftemp
  move
  move
                 vclsmlftemp, \matrix1[2]
                 \matrix1[2], \matrix2[2]
  move
                 \matrix2[2], vclsmlftemp
  move
                 vclsmlftemp, \matrix1[3]
  move
                 \matrix1[3], \matrix2[3]
  move
                 \matrix2[3], vclsmlftemp
  move
  .endm
;//-----
;// MatrixTranspose - Transpose "matrixsrc" to "matresult". It is safe
;// for "matrixsrc" and "matresult" to be the same.
;// Thanks to Colin Hugues (SCEE) for that one
;//-----
   .macro MatrixTranspose matresult, matrixsrc
  mr32.y vclsmlftemp, \matrixsrc[1]
                 \matresult[1], vf00, \matrixsrc[2][y]
  add.z
  move.y
                 \matresult[2], vclsmlftemp
  mr32.y
                 vclsmlftemp, \matrixsrc[0]
  add.z
                 \matresult[0], vf00, \matrixsrc[2][x]
  mr32.z
                 vclsmlftemp, \matrixsrc[1]
  mul.w
                \matresult[1], vf00, \matrixsrc[3][y]
               vclsmlftemp, \matrixsrc[0]
  mr32.x
 add.y
move.x
mul.w
vclsmlftemp, vf00, \matrixs
mr32.z
move.w
\text{matresult[3], \matrixsrc[2]}
move.w
\text{matresult[2], vclsmlftemp}
vclsmlftemp, \matrixsrc[3]
\text{matrix}
\text{matrix}
\text{matrix}
\text{matrix}
\text{matrix}
\text{matrix}
\text{matrix}
\text{matrix}
                \matresult[0], vf00, \matrixsrc[1][x]
              \matresult[0], v100, \matrixs1c[1][x]
\matresult[1], vclsmlftemp
vclsmlftemp, vf00, \matrixsrc[3][z]
  add.x
move.w
                 \matresult[3], vf00, \matrixsrc[0][w]
                 \matresult[0], vclsmlftemp
  mr32.y
                 \matresult[3], vclsmlftemp
  add.x
                 \matresult[2], vf00, vclsmlftemp[y]
  move.x
                 \matresult[0], \matrixsrc[0]
                                                ;// These 4
instructions will be
                \matresult[1], \matrixsrc[1]
                                                   ;// removed if
  move.y
"matrixsrc" and
                                                   ;// "matresult" are
  move.z
                \matresult[2], \matrixsrc[2]
the same
                 \matresult[3], \matrixsrc[3] ;//
  move.w
  .endm
```

```
;//-----
;// MatrixMultiply - Multiply 2 matrices, "matleft" and "matright", and
;// output the result in "matresult". Dont forget matrix multipli-
;// cations arent commutative, i.e. left X right wont give you the
;// same result as right X left.
;//
;// Note: ACC register is modified
;//-----
  .macro MatrixMultiply matresult, matleft, matright
        acc,
  m111
                           \matright[0], \matleft[0][x]
  madd
             acc,
                           \matright[1], \matleft[0][y]
             acc,
  madd
                           \matright[2], \matleft[0][z]
  madd
              \matresult[0], \matright[3], \matleft[0][w]
            acc,
acc,
  mul
                           \matright[0], \matleft[1][x]
  madd
                           \matright[1], \matleft[1][y]
                           \matright[2], \matleft[1][z]
  madd
              acc,
  madd
            \matresult[1], \matright[3], \matleft[1][w]
            acc,
                            \matright[0], \matleft[2][x]
  mul
             acc,
                            \matright[1], \matleft[2][y]
  madd
  madd
              acc,
                            \matright[2], \matleft[2][z]
  madd
              \matresult[2], \matright[3], \matleft[2][w]
                           \matright[0], \matleft[3][x]
  mul
            acc,
                           \matright[1], \matleft[3][y]
  madd
             acc,
                            \matright[2], \matleft[3][z]
  madd
             acc,
  madd
              \matresult[3], \matright[3], \matleft[3][w]
  .endm
://-----
;// LocalizeLightMatrix - Transform the light matrix "lightmatrix" into
;// local space, as described by "matrix", and output the result in
;// "locallightmatrix"
;//
;// Note: ACC register is modified
;//-----
  .macro LocalizeLightMatrix locallightmatrix, matrix, lightmatrix
                                 \lightmatrix[0], \matrix[0][x]
  mu l
             acc,
  madd
              acc,
                                 \lightmatrix[1], \matrix[0][y]
  madd
                                 \lightmatrix[2], \matrix[0][z]
              acc,
  madd
              \locallightmatrix[0], \lightmatrix[3], \matrix[0][w]
  mul
              acc,
                                  \lightmatrix[0], \matrix[1][x]
                                  \left(1\right), \max[1][y]
  madd
               acc,
  madd
               acc,
                                  \lightmatrix[2], \matrix[1][z]
  madd
               \locallightmatrix[1], \lightmatrix[3], \matrix[1][w]
                                  \lightmatrix[0], \matrix[2][x]
  mul
             acc,
                                  \lightmatrix[1], \matrix[2][y]
  madd
             acc,
                                  \lightmatrix[2], \matrix[2][z]
  madd
              acc,
  madd
               \locallightmatrix[2], \lightmatrix[3], \matrix[2][w]
               \locallightmatrix[3], \lightmatrix[3]
  move
  .endm
```

```
;//-----
;// MatrixMultiplyVertex - Multiply "matrix" by "vertex", and output
;// the result in "vertexresult"
;// Note: Apply rotation, scale and translation
;// Note: ACC register is modified
  .macro MatrixMultiplyVertex vertexresult, matrix, vertex
        acc,
                       \matrix[0], \vertex[x]
         acc,
  madd
                       \matrix[1], \vertex[y]
           acc,
  madd
                       \matrix[2], \vertex[z]
  madd
            \vertexresult, \matrix[3], \vertex[w]
  .endm
;//----
;// MatrixMultiplyVertex - Multiply "matrix" by "vertex", and output
;// the result in "vertexresult"
;//
;// Note: Apply rotation, scale and translation
;// Note: ACC register is modified
;//----
  .macro MatrixMultiplyVertexXYZ1 vertexresult,matrix,vertex
                      \matrix[0], \vertex[x]
  mul
       acc,
                       \matrix[1], \vertex[y]
 madd
           acc,
           acc,
                       \matrix[2], \vertex[z]
  madd
            \vertexresult, \matrix[3], vf00[w]
  madd
  .endm
;//-----
;// MatrixMultiplyVector - Multiply "matrix" by "vector", and output
;// the result in "vectorresult"
;//
;// Note: Apply rotation and scale, but no translation
;// Note: ACC register is modified
  .macro MatrixMultiplyVector vectorresult, matrix, vector
  mul acc, \matrix[0], \vector[x]
                       \matrix[1], \vector[y]
  madd
           acc,
            \vectorresult, \matrix[2], \vector[z]
  madd
  .endm
://-----
;// VectorLoad - Load "vector" from VU mem location "vumemlocation" +
;// "offset"
;//-----
  .macro VectorLoad vector, offset, vumemlocation
  lq
            \vector, \offset(\vumemlocation)
  .endm
;//-----
;// VectorSave - Save "vector" to VU mem location "vumemlocation" +
;// "offset"
;//-----
  .macro VectorSave vector,offset,vumemlocation
       \vector, \offset(\vumemlocation)
  .endm
```

```
;//-----
;// VectorAdd - Add 2 vectors, "vector1" and "vector2" and output the
;// result in "vectorresult"
;//----
 .macro VectorAdd vectorresult, vector1, vector2
 add
        \vectorresult, \vector1, \vector2
  .endm
;//----
;// VectorSub - Subtract "vector2" from "vector1", and output the
;// result in "vectorresult"
://-----
  .macro VectorSub vectorresult, vector1, vector2
 sub
          \vectorresult, \vector1, \vector2
 .endm
://-----
;// VertexLoad - Load "vertex" from VU mem location "vumemlocation" +
;//-----
  .macro VertexLoad vertex, offset, vumemlocation
 lq
           \vertex, \offset(\vumemlocation)
 .endm
://-----
;// VertexSave - Save "vertex" to VU mem location "vumemlocation" +
;// "offset"
;//-----
  .macro VertexSave vertex, offset, vumemlocation
      \vertex, \offset(\vumemlocation)
  .endm
;//----
;// VertexPersCorr - Apply perspective correction onto "vertex" and
;// output the result in "vertexoutput"
;//
;// Note: Q register is modified
;//-----
 .macro VertexPersCorr vertexoutput, vertex
           q, vf00[w], \vertex[w]
 mul
           \vertexoutput, \vertex, q
 .endm
;//----
;// VertexPersCorrST - Apply perspective correction onto "vertex" and
;// "st", and output the result in "vertexoutput" and "stoutput"
;//
;// Note: Q register is modified
;//-----
 .macro VertexPersCorrST vertexoutput, stoutput, vertex, st
       q, vf00[w], \vertex[w]
          \vertexoutput, \vertex, q
 mul.xyz
move.w
          \vertexoutput, \vertex
 move.w
 mul
           \stoutput, \st,
 .endm
```

```
;//-----
;// VertexFPtoGsXYZ2 - Convert an XYZW, floating-point vertex to GS
;// XYZ2 format (ADC bit isnt set)
;//----
  .macro VertexFpToGsXYZ2 outputxyz,vertex
 ftoi4.xy \outputxyz, \vertex
ftoi0.z \outputxyz. \vertex
 ftoi0.z
           \outputxyz, \vertex
           \outputxyz, vi00
 mfir.w
  .endm
;//-----
;// VertexFPtoGsXYZ2Adc - Convert an XYZW, floating-point vertex to GS
;// XYZ2 format (ADC bit is set)
;//-----
  .macro VertexFpToGsXYZ2Adc outputxyz,vertex
 ftoi4.xy \outputxyz, \vertex
 ftoi0.z
          \outputxyz, \vertex
 ftoi15.w
          \outputxyz, vf00
  .endm
;//-----
;// VertexFpToGsXYZF2 - Convert an XYZF, floating-point vertex to GS
;// XYZF2 format (ADC bit isnt set)
://-----
  .macro VertexFpToGsXYZF2 outputxyz,vertex
 ftoi4 \outputxyz, \vertex
 .endm
://-----
;// VertexFpToGsXYZF2Adc - Convert an XYZF, floating-point vertex to GS
;// XYZF2 format (ADC bit is set)
;//----
  .macro VertexFpToGsXYZF2Adc outputxyz,vertex
          \outputxyz, \vertex
 mtir
           vclsmlitemp, \outputxyz[w]
 iaddiu
          vclsmlitemp, 0x7FFF
 iaddi
           vclsmlitemp, 1
 mfir.w
           \outputxyz, vclsmlitemp
 .endm
://-----
;// ColorFPtoGsRGBAQ - Convert an RGBA, floating-point color to GS
;// RGBAQ format
;//----
  .macro ColorFPtoGsRGBAQ outputrgba,color
 ftoi0
          \outputrgba, \color
  .endm
;//-----
;// ColorGsRGBAQtoFP - Convert an RGBA, GS RGBAQ format to floating-
;// point color
;//-----
  .macro ColorGsRGBAQtoFP outputrgba, color
 itof0
      \outputrgba, \color
  .endm
```

```
;//-----
;// CreateGsPRIM - Create a GS-packed-format PRIM command, according to
;// a specified immediate value "prim"
;// Note: Meant more for debugging purposes than for a final solution
;//-----
  .macro CreateGsPRIM outputprim,prim
  iaddiu
            vclsmlitemp, vi00, \prim
  mfir
             \outputprim, vclsmlitemp
  .endm
;//-----
;// CreateGsRGBA - Create a GS-packed-format RGBA command, according to
;// specified immediate values "r", "g", "b" and "a" (integer 0-255)
;// Note: Meant more for debugging purposes than for a final solution
  .macro CreateGsRGBA outputrgba, r, g, b, a
  iaddiu vclsmlitemp, vi00, \r
  mfir.x
             \outputrgba, vclsmlitemp
  iaddiu
            vclsmlitemp, vi00, \g
  mfir.y
             \outputrgba, vclsmlitemp
  iaddiu
             vclsmlitemp, vi00, \b
  mfir.z
             \outputrgba, vclsmlitemp
  iaddiu
            vclsmlitemp, vi00, \a
  mfir.w
             \outputrgba, vclsmlitemp
  .endm
;//-----
;// CreateGsSTQ - Create a GS-packed-format STQ command, according to
;// specified immediate values "s", "t" and "q" (floats)
;//
;// Note: I register is modified
;// Note: Meant more for debugging purposes than for a final solution
  .macro CreateGsSTQ outputstq,s,t,q
  loi
             \s
  add.x
             \outputstq, vf00, i
             \t
        \outputstq, vf00, i
  add.y
             /q
  loi
  add.z
            \outputstq, vf00, i
  .endm
;//-----
;// CreateGsUV - Create a GS-packed-format VU command, according to
;// specified immediate values "u" and "v" (integer -32768 - 32768,
;// with 4 LSB as precision)
;//
;// Note: Meant more for debugging purposes than for a final solution
;//----
  .macro CreateGsUV outputuv,u,v
         vclsmlitemp, vi00, \u
  iaddiu
            \outputuv, vclsmlitemp
  mfir.x
  iaddiu
            vclsmlitemp, vi00, \v
  mfir.y
             \outputuv, vclsmlitemp
  .endm
```

```
;//-----
;// CreateGsRGBA - Create a GS-packed-format RGBA command, according to
;// a specified immediate value "fog" (integer 0-255)
;// Note: Meant more for debugging purposes than for a final solution
;//----
  .macro CreateGsFOG outputfog, fog
  iaddiu vclsmlitemp, vi00, \fog * 16
  mfir.w
            \outputfog, vclsmlitemp
  .endm
://-----
;// VectorDotProduct - Calculate the dot product of "vector1" and
;// "vector2", and output to "dotproduct"[x]
;//----
  .macro VectorDotProduct dotproduct, vector1, vector2
  mul.xyz
            \dotproduct, \vector1, \vector2
            \dotproduct, \dotproduct[y]
  add.x
            \dotproduct, \dotproduct[z]
  add.x
  .endm
;//----
;// VectorDotProductACC - Calculate the dot product of "vector1" and
;// "vector2", and output to "dotproduct"[x]. This one does it using
;// the ACC register which, depending on the case, might turn out to be
;// faster or slower.
;// Note: ACC register is modified
;//----
  .macro VectorDotProductACC dotproduct,vector1,vector2
            Vector1111, vf00, vf00[w]
vclsmlftemp, \vector1, \vector2
  max
  mul
            acc, vclsmlftemp, vclsmlftemp[y]
  add.x
  madd.x
            \dotproduct, Vector1111, vclsmlftemp
  .endm
://-----
;// \mbox{VectorCrossProduct} - Calculate the cross product of "vector1" and
;// "vector2", and output to "vectoroutput"
;// Note: ACC register is modified
://-----
  .macro VectorCrossProduct vectoroutput, vector1, vector2
  opmula.xyz ACC, \vector1, \vector2
opmsub.xyz \vectoroutput, \vector2, \vector1
sub.w \vectoroutput, \vfn0, \vfn0
  sub.w
            \vectoroutput, vf00, vf00
  .endm
```

```
;//-----
;// VectorNormalize - Bring the length of "vector" to 1.f, and output
;// it to "vectoroutput"
;// Note: Q register is modified
;//-----
  .macro VectorNormalize vecoutput, vector
  mul.xyz vclsmlftemp, \vector, \vector
add.x vclsmlftemp, vclsmlftemp, vclsmlftemp[y]
add.x vclsmlftemp, vclsmlftemp[z]
rsqrt q, vf00[w], vclsmlftemp[x]
sub.w \vecoutput, vf00, vf00
mul.xyz \vecoutput, \vector, q
  .endm
://----
;// VectorNormalizeXYZ - Bring the length of "vector" to 1.f, and out-
;// put it to "vectoroutput". The "w" field isn't transfered.
;//
;// Note: Q register is modified
://-----
   .macro VectorNormalizeXYZ vecoutput, vector
  mul.xyz vclsmlftemp, \vector, \vector
  add.x vclsmlftemp, vclsmlftemp, vclsmlftemp[y] add.x vclsmlftemp, vclsmlftemp, vclsmlftemp[z] rsqrt q, vf00[w], vclsmlftemp[x] mul.xyz \vecoutput, \vector, q
  .endm
;//----
;// VertexLightAmb - Apply ambient lighting "ambientrgba" to a vertex
;// of color "vertexrgba", and output the result in "outputrgba"
;//-----
  .macro VertexLightAmb rgbaout, vertexrgba, ambientrgba
        \rgbaout, \vertexrgba, \ambientrgba
  m111
  .endm
://----
;// VertexLightDir3 - Apply up to 3 directional lights contained in a
;// light matrix "lightmatrix" to a vertex of color "vertexrgba" and
;// having a normal "vertexnormal", and output the result in
;// "outputrgba"
;//
;// Note: ACC register is modified
;//----
  .macro VertexLightDir3
rgbaout, vertexrgba, vertexnormal, lightcolors, lightnormals
        acc, \lightnormals[0], \vertexnormal[x]
acc, \lightnormals[1], \vertexnormal[y]
acc, \lightnormals[2], \vertexnormal[z]
  mul
  madd
  madd
  madd \rgbaout, \lightnormals[3], \vertexnormal[w];// Here
"rgbaout" is the dot product for the 3 lights
  max \rgbaout, \rgbaout, vf00[x]
                                                      ;// Here
"rgbaout" is the dot product for the 3 lights
  mul
        acc, \lightcolors[0], \rgbaout[x]
                     \lightcolors[1], \rgbaout[y]
  madd
              acc,
              \rgbaout, \lightcolors[2], \rgbaout[z] ;// Here
  madd
"rgbaout" is the light applied on the vertex mul \rgbaout, \vertexrgba, \rgbaout ;// Here
"rgbaout" is the amount of light reflected by the vertex
```

```
;//-----
;// VertexLightDir3Amb - Apply up to 3 directional lights, plus an
;// ambient light contained in a light matrix "lightmatrix" to a vertex
;// of color "vertexrgba" and having a normal "vertexnormal", and
;// output the result in "outputrgba"
;//
;// Note: ACC register is modified
;//----
  .macro VertexLightDir3Amb
rgbaout, vertexrgba, vertexnormal, lightcolors, lightnormals
  mul          acc,          \lightnormals[0], \vertexnormal[x]
madd          acc,          \lightnormals[1], \vertexnormal[y]
madd          acc,          \lightnormals[2], \vertexnormal[z]
madd          \rgbaout, \lightnormals[3], \vertexnormal[w];// Here
"rgbaout" is the dot product for the 3 lights
  max \rgbaout, \rgbaout, vf00[x]
                                                       ;// Here
"rgbaout" is the dot product for the 3 lights
  mul acc, \lightcolors[0], \rgbaout[x]
              acc, \lightcolors[1], \rgbaout[y]
acc, \lightcolors[2], \rgbaout[z]
  madd
  madd
        \rgbaout, \lightcolors[3], \rgbaout[w] ;// Here
  madd
"rgbaout" is the light applied on the vertex
  mul.xyz \rgbaout, \vertexrgba, \rgbaout
                                                       ;// Here
"rgbaout" is the amount of light reflected by the vertex
  .endm
://----
;// FogSetup - Set up fog "fogparams", by specifying "nearfog" and
;// "farfog". "fogparams" will afterward be ready to be used by fog-
;// related macros, like "VertexFogLinear" for example.
;//
;// Note: I register is modified
;//-----
  .macro FogSetup fogparams, nearfogz, farfogz
         \fogparams, vf00, vf00
                                                  ;// Set XYZW to
                                                     ;//
  loi
                \farfogz
  add.w
               \fogparams, \fogparams, i
                                                     ;// fogparam[w]
is farfogz
  loi
               \nearfogz
             \fogparams, \fogparams, \fogparams[w]
  add.z
  sub.z
              \fogparams, \fogparams, i
              255.0
  loi
  add.xy
               \fogparams, \fogparams, i
                                                     ;// fogparam[y]
is 255.0
              \fogparams, \fogparams, vf00[w]
                                                     ;// fogparam[x]
  sub.x
is 254.0
                          \fogparams[y], \fogparams[z]
  div
               \fogparams, \fogparams, \fogparams
  sub.z
               \fogparams, \fogparams, q;// fogparam[z] is 255.f /
  add.z
(farfogz - nearfogz)
  .endm
```

```
;//-----
;// \ensuremath{\text{VertexFogLinear}} - Apply fog "fogparams" to a \ensuremath{\text{vertex}} "xyzw", and
;// output the result in "xyzfoutput". "xyzw" [w] is assumed to be
;// the distance from the camera. "fogparams" must contain farfogz in
;// [w], and (255.f / (farfogz - nearfogz)) in [z]. "xyzfoutputf" [w]
;// will contain a float value between 0.0 and 255.0, inclusively.
://----
  .macro VertexFogLinear xyzfoutput,xyzw,fogparams
           \xyzfoutput, \xyzw
                                          ;// XYZ part won't
  move.xyz
be modified
  sub.w
             \xyzfoutput, \fogparams, \xyzw[w]
                                          ;// fog = (farfogz -
z) * 255.0 /
 mul.w
             \xyzfoutput, \xyzfoutput, \fogparams[z];//
nearfogz)
             \xyzfoutput, \xyzfoutput, vf00[x] ;// Clamp fog values
 max.w
outside the
             \xyzfoutput, \xyzfoutput, \fogparams[y];// range 0.0-255.0
 mini.w
  .endm
;//-----
;// VertexFogRemove - Remove any effect of fog to "xyzf". "fogparams"
;// [x] must be set to 254.0. "xyzf" will be modified directly.
  .macro VertexFogRemove xyzf,fogparams
            \xyzf, \xyzw[w] = 1.0 + 254.0 =
255.0 = no fog
  .endm
;//-----
;// PushInteger1 - Push "integer1" on "stackptr"
;// Note: "stackptr" is updated
;//----
  .macro PushInteger1 stackptr,integer1
  isubiu \stackptr, \stackptr, 1
  iswr.x
             \integer1, (\stackptr):VCLSML STACK
  .endm
;//-----
;// PushInteger2 - Push "integer1" and "integer2" on "stackptr"
;// Note: "stackptr" is updated
;//-----
  .macro PushInteger2 stackptr,integer1,integer2
  isubiu \stackptr, \stackptr, 1
             \integer1, (\stackptr): VCLSML STACK
  iswr.x
            \integer2, (\stackptr): VCLSML STACK
  iswr.y
  .endm
;//-----
;// PushInteger3 - Push "integer1", "integer2" and "integer3" on
;// "stackptr"
;//
;// Note: "stackptr" is updated
  .macro PushInteger3 stackptr,integer1,integer2,integer3
            \stackptr, \stackptr, 1
  isubiu
             \integer1, (\stackptr):VCLSML STACK
  iswr.x
            \integer2, (\stackptr): VCLSML STACK
            \integer3, (\stackptr):VCLSML STACK
  iswr.z
  .endm
```

```
;//----
;// PushInteger4 - Push "integer1", "integer2", "integer3" and
;// "integer4" on "stackptr"
;//
;// Note: "stackptr" is updated
;//-----
  .macro PushInteger4 stackptr,integer1,integer2,integer3,integer4
  isubiu \stackptr, \stackptr, 1
  iswr.x
            \integer1, (\stackptr):VCLSML STACK
  iswr.y
            \integer2, (\stackptr):VCLSML_STACK
  iswr.z
            \integer3, (\stackptr):VCLSML STACK
            \integer4, (\stackptr):VCLSML STACK
  iswr.w
  .endm
;//-----
;// PopInteger1 - Pop "integer1" on "stackptr"
;//
;// Note: "stackptr" is updated
;//----
  .macro PopInteger1 stackptr,integer1
  ilwr.x \integer1, (\stackptr):VCLSML_STACK
iaddin \stackptr \stackptr 1
  iaddiu
            \stackptr, \stackptr, 1
  .endm
;//----
;// PopInteger2 - Pop "integer1" and "integer2" on "stackptr"
;//
;// Note: "stackptr" is updated
;//----
  .macro PopInteger2 stackptr,integer1,integer2
  ilwr.y \integer2, (\stackptr):VCLSML_STACK
ilwr.x \integer1 (\stackptr):VCLSML_STACK
  ilwr.x
iaddiu
            \integer1, (\stackptr):VCLSML STACK
           \stackptr, \stackptr, 1
  .endm
://-----
;// PopInteger3 - Pop "integer1", "integer2" and "integer3" on
;// "stackptr"
;//
;// Note: "stackptr" is updated
://-----
  .macro PopInteger3 stackptr,integer1,integer2,integer3
  ilwr.z \integer3, (\stackptr):VCLSML_STACK
  ilwr.y
            \integer2, (\stackptr):VCLSML STACK
  ilwr.x
            \integer1, (\stackptr):VCLSML_STACK
  iaddiu
            \stackptr, \stackptr, 1
  .endm
```

```
;//-----
;// PopInteger4 - Pop "integer1", "integer2", "integer3" and
;// "integer4" on "stackptr"
;// Note: "stackptr" is updated
;//-----
  .macro PopInteger4 stackptr,integer1,integer2,integer3,integer4
         \integer4, (\stackptr):VCLSML STACK
  ilwr.w
  ilwr.z
            \integer3, (\stackptr):VCLSML STACK
  ilwr.y
            \integer2, (\stackptr):VCLSML STACK
  ilwr.x
            \integer1, (\stackptr):VCLSML STACK
  iaddiu
            \stackptr, \stackptr, 1
  .endm
;//-----
;// PushMatrix - Push "matrix" onto the "stackptr"
;//
;// Note: "stackptr" is updated
  .macro PushMatrix stackptr, matrix
            \matrix[0], -1(\stackptr):VCLSML_STACK
            \matrix[1], -2(\stackptr):VCLSML_STACK
  sa
  sq
            \matrix[2], -3(\stackptr):VCLSML_STACK
            \matrix[3], -4(\stackptr):VCLSML STACK
  sq
  iaddi
            \stackptr, \stackptr, -4
  .endm
;//-----
;// PopMatrix - Pop "matrix" out of the "stackptr"
;//
;// Note: "stackptr" is updated
;//----
  .macro PopMatrix stackptr, matrix
            \matrix[0], 0(\stackptr):VCLSML STACK
            \matrix[1], 1(\stackptr):VCLSML_STACK
  lq
            \matrix[2], 2(\stackptr):VCLSML STACK
  lq
            \matrix[3], 3(\stackptr):VCLSML_STACK
  lq
            \stackptr, \stackptr, 4
  iaddi
  .endm
://-----
;// PushVector - Push "vector" onto the "stackptr"
;//
;// Note: "stackptr" is updated
;//-----
  .macro PushVector stackptr, vector
  sqd \vector, (--\stackptr):VCLSML STACK
  .endm
;//-----
;// PopVector - Pop "vector" out of the "stackptr"
;//
;// Note: "stackptr" is updated
;//-----
  .macro PopVector stackptr, vector
       \vector, (\stackptr++):VCLSML_STACK
  lqi
  .endm
```

```
://-----
;// PushVertex - Push "vector" onto the "stackptr"
;// Note: "stackptr" is updated
;//-----
  .macro PushVertex stackptr, vertex
        \vertex, (--\stackptr):VCLSML STACK
  sad
  .endm
://-----
;// PopVertex - Pop "vertex" out of the "stackptr"
;// Note: "stackptr" is updated
;//----
  .macro PopVertex stackptr, vertex
             \vertex, (\stackptr++):VCLSML STACK
  .endm
;//----
;// AngleSinCos - Returns the sin and cos of up to 2 angles, which must
;// be contained in the X and Z elements of "angle". The sin/cos pair
;// will be contained in the X/Y elements of "sincos" for the first
;// angle, and {\ensuremath{{\mbox{Z}}}/{\ensuremath{{\mbox{W}}}}} for the second one.
;// Thanks to Colin Hugues (SCEE) for that one
;// Note: ACC and I registers are modified, and a bunch of temporary
;// variables are created... Maybe bad for VCL register pressure
;//-----
  .macro AngleSinCos angle, sincos
  move.xz \sincos, \angle
                                    ; To avoid modifying the
original angles...
  mul.w \sincos, vf00, \sincos[z] ; Copy angle from z to w
add.y \sincos, vf00, \sincos[x] ; Copy angle from x to y
  loi
             1.570796
                                    ; Phase difference for sin as
cos ( PI/2 )
            \sincos, \sincos, I
 sub.xz
                                    ;
             \sincos, \sincos
 abs
                                              ; Mirror cos
around zero
            Vector1111, vf00, vf00[w]
 max
                                              ; Initialise all
1s
            -0.159155 ; Scale so single cycle is range 0 to -1 (
  loi
*-1/2PI )
            ACC, \sincos, I
 mul
            12582912.0
  loi
                                    ; Apply bias to remove
fractional part
 msub ACC, Vector1111, I ; Remove bias to leave
original int part
             -0.159155 ; Apply original number to leave
fraction range only
 msub ACC, \sincos, I
          0.5
                                               ; Ajust range: -
 loi
0.5 to +0.5
```

```
msub \sincos, Vector1111, I
  abs
              \sincos, \sincos
                                                   ; Clamp: 0 to
+0.5
  loi
             0.25
                                                   ; Ajust range: -
0.25 to +0.25
             \sincos, \sincos, I
  sub
              anglepower2, \sincos, \sincos
                                                  ; a^2
  mu l
               -76.574959
  loi
               k4angle, \sincos, I
                                                   ; k4 a
  mul
  loi
               -41.341675
                                                  ; k2 a
  mul
              k2angle, \sincos, I
  loi
             81.602226
  mul
              k3angle, \sincos, I
                                                   ; k3 a
            anglepower4, anglepower2, anglepower2 ; a^4 k4angle, k4angle, anglepower2 ; k4 a^3 ACC, k2angle, anglepower2 ; + k2 a
  mul
  mul
                                                   ; + k2 a^3
  mul
              39.710659
  loi
                                                   ; k5 a
  mul
               k2angle, \sincos, I
             anglepower8, anglepower4, anglepower4 ; a^8
  mul
             ACC, k4angle, anglepower4
                                                  ; + k4 a^7
  madd
  madd
             ACC, k3angle, anglepower4
                                                  ; + k3 a^5
             6.283185
  loi
             ACC, \sincos, I
  madd
                                                   ; + k1 a
              \sincos, k2angle, anglepower8
  madd
                                                   ; + k5 a^9
  .endm
;//-----
;// QuaternionToMatrix - Converts a quaternion rotation to a matrix
;// Thanks to Colin Hugues (SCEE) for that one
;// Note: ACC and I registers are modified
  .macro QuaternionToMatrix matresult, quaternion
  mula.xyz ACC, \quaternion, \quaternion ; xx yy zz
 loi 1.414213562
muli vclsmlftemp, \quaternion, I ; x sqrt2 y sqrt2 z
sqrt2 w sqrt2
           \matresult[0], vf00
                                               ; Set rhs matrix
  mr32.w
line 0 to 0
            \matresult[1], vf00
 mr32.w
               \matresult[2], vf00
  mr32.w
                                                ; Set rhs matrix
               \matresult[3], vf00
  move
                                                ; Set bottom line to
0 0 0 1
  madd.xyz vcl_2qq, \quaternion, \quaternion ; 2xx 2yy
2zz
 addw.xyz Vector111, vf00, vf00
                                       ; 1
```

```
opmula.xyz ACC, vclsmlftemp, vclsmlftemp
                                                ; 2yz
                                                                2xz
2xy -
  msubw.xyz vclsmlftemp2, vclsmlftemp, vclsmlftemp; 2yz-2xw 2xz-2yz
2xy-2zw -
  maddw.xyz vclsmlftemp3, vclsmlftemp, vclsmlftemp; 2yz+2xw 2xz+2yz
2xy+2zw -
  addaw.xyz ACC, vf00, vf00
                                                     ; 1
                                                                  1
  msubax.yz ACC, Vector111, vcl_2qq
                                                      ; 1
                                                                  1-2xx
1-2xx
                 \matresult[2], Vector111, vcl 2qq
  msuby.z
1-2xx-2yy -
  msubay.x
                 ACC, Vector111, vcl 2qq
                                                      ; 1-2yy
                                                                 1-2xx
1-2xx-2yy -
  msubz.y
                 \matresult[1], Vector111, vcl 2qq
                                                                 1-2xx-
  mr32.y
2zz -
                 \matresult[0], vclsmlftemp2
                 \matresult[0], Vector111, vcl 2qq ; 1-2yy-2zz -
  msubz.x
              \matresult[2], vclsmlftemp2
\matresult[0], vf00, vclsmlftemp3
vclsmlftemp, vclsmlftemp2
\matresult[1], vclsmlftemp
\matresult[2], vf00, vclsmlftemp3
vclsmlftemp3, vclsmlftemp3
  mr32.x
  addy.z
  mr32.w
  mr32.z
  addx.y
  mr32.y
mr32.x
                 \matresult[1], vclsmlftemp3
  .endm
;//-----
;// QuaternionMultiply - Multiplies "quaternion1" and "quaternion2",
;// and puts the result in "quatresult".
;// Thanks to Colin Hugues (SCEE) for that one
;//
;// Note: ACC register is modified
://-----
  .macro QuaternionMultiply quatresult,quaternion1,quaternion2
                 vclsmlftemp, \quaternion1, \quaternion2 ; xx yy zz ww
  mul
                             \quaternion1, \quaternion2 ; Start
  opmula.xyz ACC,
Outerproduct
  madd.xyz ACC, madd.xyz ACC,
                             \quaternion1, \quaternion2[w]; Add w2.xyz1
                              \quaternion2, \quaternion1[w]; Add w1.xyz2
  opmsub.xyz
                \quatresult, \quaternion2, \quaternion1 ; Finish
Outerproduct
              ACC, vclsmlftemp, vclsmlftemp[z]; ww - zz
ACC, vf00, vclsmlftemp[y]; ww - zz - yy
\quatresult, vf00, vclsmlftemp[x]; ww - zz - yy -
  sub.w
  msub.w
  msub.w
хx
   .endm
```

# Appendix B: Detailed Information Regarding Loops in VCL

## Pipelining and VCL

When the "--LoopCS" directive is used, VCL will attempt to unroll and pipeline back to the start any block of code that ends with a conditional branch.

VCL analyzes the sequence of instructions and determines what the best size for the loop would be. A simplistic calculation for the loop size is:

```
best size = max (number upper instructions, number lower instructions,
sum_throughput_p, sum_throughput q)
```

The actual calculation is actually more involved, particularly due to issues like IALU instructions placing (with respect to the branch) and the possibility of circular dependency chains in the instruction sequence.

Note: VCL currently relies on the user to identify memory store/load sequences. It does so by the use of tags. (Refer to "Memory Aliasing and Instructions Reordering" for more details.)

For a typical renderer, the usual circular chain is of the form:

```
IADDIU
                     ptr, ptr, sizeof
        IBNE
                    ptr, end ptr, loop
Or:
        IADDI
                     count, count, -1
        IBNE
                     count, vi00, loop
```

These cases aren't long compared to the size of the loop. (The rest of a typical renderer pulls data out of memory, processes it, then writes it back to memory. However, it does not cross a loop iteration boundary.)

For other types of code loops (such as physics dynamics calculations on strings), where the output from one loop iteration is an input to the next, and the length of the calculation is the same as the overall length of the loop, pipelining will not greatly improve performances without a modification of the algorithm. Such a modification could be to process multiple strings at once.

After analyzing the code, VCL tries to schedule it so that it fits into a block that is of size greater or equal to 0, but smaller than the loop size. It does so by wrapping instructions off the end and back to the start. The number of times an instruction is wrapped around the loop will determine the stage number to which it belongs.

In short:

- In linear mode (non-looped), VCL schedules in Z instructions.
- In loop mode, it schedules in Z modulo n, where n equals best\_size + current optimization phase.
- Software pipelining where the branch is at the end of the critical path will not be optimized greatly by VCL.

Study the following example, which will be referred to later in this appendix:

```
vrt loop:
     --LoopCS
                10,10
     LOI
                 vrt, (in p++)
                                          : Load vertex
                 acc, mat[0], vrt[x]
     MUL
                                          ; Transform vertex
                 acc, mat[1], vrt[y]
     MADD
                                          ;
```

```
MADD acc, mat[2], vrt[z]
MADD
        camv, mat[3], vf00[w]
DIV
         q, vf00[w], camv[w] ; Perspective correction
MUL
         screenv, camv, q
         fixpt, screenv
FTOI4
                             ; Convert to GS format
SQI
         fixpt, (out p++)
                            ; Save the vertex
         IADDI
IBNE
```

The following is a typical sequence of instructions for a loop, where the numerals 1 to 9 denote blocks of instructions that are grouped by pipeline stages. Execution in the order 1, 2, 3, 4, 5, 6, 7, 8, 9 is equivalent to one iteration of the loop.

Figure 1



"c" denotes the stage for the conditional.. In the example above, it is highly likely that the conditional will be at stage 1, since it is not dependent on much else in the code. "m" is the last stage. In a typical renderer, this would often correspond to the color calculations.

As pipeline execute happens:

Figure 2

| Prologue | 1  |    |    |    |   |   |   |   |   | P1 starts                                     |
|----------|----|----|----|----|---|---|---|---|---|-----------------------------------------------|
|          | 2  | 1  |    |    |   |   |   |   |   | P2 starts, P1 is at stage 2                   |
|          | 3  | 2  | 1  |    |   |   |   |   |   | P3 starts, P2 is at stage 2, P3 is at stage 3 |
|          | 4  | 3  | 2  | 1  |   |   |   |   |   | Etc                                           |
|          | 5  | 4  | 3  | 2  | 1 |   |   |   |   |                                               |
|          | 6c | 5  | 4  | 3  | 2 | 1 |   |   |   |                                               |
|          | 7  | 6c | 5  | 4  | 3 | 2 | 1 |   |   |                                               |
|          | 8  | 7  | 6c | 5  | 4 | 3 | 2 | 1 |   |                                               |
| MainLoop | 9  | 8  | 7  | 6c | 5 | 4 | 3 | 2 | 1 | Main body of the loop                         |
| Epilogue |    | 9  | 8  | 7  |   |   |   |   |   |                                               |
|          |    |    | 9  | 8  |   |   |   |   |   |                                               |
|          |    |    |    | 9  |   |   |   |   |   |                                               |

If the conditional is at stage C (here, 6), then stages 1 to C-1 will miss-execute, i.e. in the main loop, the pipeline runs C-1 stages ahead of the conditional. But when the condition is found to be true, VCL will only complete the processing for valid stages which are in the graph above 9, 8-9, and 7-8-9 (epilogue part).

# "--LoopCS n,m" Directive

## n (Minimum Number of Loops)

For small loop counts, there are many instructions that are associated with pipeline stages ahead of the conditional. To get better performance for a small count (n greater or equal to 1 but smaller than M-C), once the conditional in the loop is encountered, VCL can jump to a special case to complete the calculations on the required pipeline stages for this count.

Following is a modified version of the above diagram:

Figure 3

|          | 1  |    |    |    |   |   |   |   |   |
|----------|----|----|----|----|---|---|---|---|---|
|          | 2  | 1  |    |    |   |   |   |   |   |
|          | 3  | 2  | 1  |    |   |   |   |   |   |
|          | 4  | 3  | 2  | 1  |   |   |   |   |   |
|          | 5  | 4  | 3  | 2  | 1 |   |   |   |   |
| А        | 6c | 5  | 4  | 3  | 2 | 1 |   |   |   |
| В        | 7  | 6c | 5  | 4  | 3 | 2 | 1 |   |   |
| С        | 8  | 7  | 6c | 5  | 4 | 3 | 2 | 1 |   |
| MainLoop | 9  | 8  | 7  | 6c | 5 | 4 | 3 | 2 | 1 |
|          |    | 9  | 8  | 7  |   |   |   |   |   |
|          |    |    | 9  | 8  |   |   |   |   |   |
|          |    |    |    | 9  |   |   |   |   |   |

If the code has a small minimum count, such as n = 1, it is possible for the code to exit at "A" and go to a special-case epilogue (EPI\_A). The steps that have already taken place are:

Figure 4

|   | 1  |   |   |   |   |   |
|---|----|---|---|---|---|---|
|   | 2  | 1 |   |   |   |   |
|   | 3  | 2 | 1 |   |   |   |
|   | 4  | 3 | 2 | 1 |   |   |
|   | 5  | 4 | 3 | 2 | 1 |   |
| А | 6c | 5 | 4 | 3 | 2 | 1 |

If the condition at stage 6 is found to be true, the following will be executed:

Figure 5

| EPI_A | 7 |  |  |  |
|-------|---|--|--|--|
|       | 8 |  |  |  |
|       | 9 |  |  |  |

Similarly for n = 2, a special epilogue may be created for the following (EPI-B):

Figure 6

| EPI_B | 8 | 7 |  |  |
|-------|---|---|--|--|
|       | 9 | 8 |  |  |
|       |   | 9 |  |  |

In the case above, having nine stages would require the creation of nine special epilogues, which is a lot of code generation. However, if VCL is told –via the "Minimum Number of Loops"- that there will always be, for example, three iterations, then the special case codes EPI\_A and EPI\_B as well as the two conditionals A and B may be removed altogether.

Up to at least version 1.3, VCL doesn't reschedule instructions across conditionals. So conditional removals by ways described above will most certainly result in better code optimization.

### m (Slop Count)

Referring to the tables above, it can be seen that, in the main loop, VCL will execute some stages ahead of the conditional (in the example, stages 1 to 5). If these contain instructions with side effects (like memory stores and XGKick), this could result in data corruption, since by the time the conditional takes place, such instructions would have already executed.

In the following table, "\*" denotes stages containing an instruction with side-effects, and "f" denotes the first instruction containing instructions with side-effects. "c" denotes the conditional stage, and "m" the last stage.

Figure 7

| 1   |
|-----|
| 2f* |
| 3   |
| 4*  |
| 5   |
| 6c  |
| 7*  |
| 8   |
| 9m  |
|     |

If the first side effect is at stage 2, then stage 2 will have miss-executed a maximum count greater or equal to 0, but smaller than c-f (6-2=4). This is acceptable if extra padding is provided at the end of the store buffer. The number of available padding slots is specified with the Slop Count.

If, for example, m=1, then the above case would generate incorrect code, since c-f=4.

However, for the following case, the generated code would be correct, since c-f=1.

Figure 8



If no side-effect stages can be mis-executed, then the Slop Count must be set to 0. Note that this will, however, result in real constraints on VCL code generation.

#### .rawloop / .endrawloop

For regular loops, VCL will analyze the loop and decide how to stage the instructions and reschedule them accordingly. Then compatible prologues and epilogues will be created around the loop's main body.

In some cases, however, you may already know what the stages are like, and simply want VCL to unroll them and create the prologue and epilogue. Use raw loops for this..

The regular loop shown in "Pipelining and VCL" would be similar to the following example, using raw loops:

```
.rawloop
vrt loop:
      --LoopCS 10,10
      2..MADD acc, mat[2], vrt[z] ; Rotate vertex .. 2
    = 1..LQI vrt, (in p++) ; Load vertex, increment pointer
      2..MADD camv, mat[3], vf00[w]; Translate rotated vertex
    = 3..MOVE ocamv, camv
                                 ; Save copy of camera space coordinate
        NOP
    = 5..SQI fixpt, (out p++)
                                      ; Save out GS-format vertex
    4..FTOI4 fixpt, screenv ; Convert screen coordinate to GS-format = 1..IADDI count, count, -1 ; Decrement loop counter
      1..MUL acc, mat[0], vrt[x]; Rotate vertex .. 0
    = NOP
      1..MADD acc, mat[0], vrt[y] ; Rotate vertex .. 1
     = 1..IBNE count, vi00, vrt loop; Reached the end?
      3..MUL screenv, ocamv, q ; Do perspective divide 2..DIV q, vf00[w], camv[w] ; Start perspective divide calculation
.endrawloop
```

The "<n>..." instruction syntax tells VCL in which stage of the loop the instruction belongs, so it can generate proper prologues and epilogues.

In raw mode, "=" may be used as a "line continue" character, as long as it is the first non-white character on a lower-instruction line. This permits better comments placement. If not used, upper and lower instructions must be placed on the same line, much like regular VSM code.

VCL will create the prologue by first taking the instructions with a "1..." suffix, then the instructions with a "2..." suffix, and so on.

Prologue and epilogue instructions are not rescheduled in the case of a raw loop, as they are for regular loop unrolling. Therefore, this may result in sub-optimal code. However, this is necessary for cases where the code needs to run on VUO in parallel with code on the EE, without synchronization.

46 Appendix B: Detailed Information Regarding Loops in VCL