Assembly language

Darren Kulp edited this page Aug 12, 2015 · 36 revisions

Assembler Syntax and Machine Model

Registers

tenyr has sixteen 32-bit two's-complement signed integer registers, named A through P (case is ignored). Two of these registers, A and P, are special, while the others are general purpose (see Instruction Shorthand and Control Flow for descriptions of the special A and P registers).

Instruction Format

Every tenyr instruction can be expressed in four algebraic instruction formats (type0, type1, type2, type3) :

Z <- X op Y + I
Z <- X op I + Y
Z <- I op X + Y
Z <- X      + I20

where Z is a register, op is one of the accepted arithmetic operations, + is addition, X and Y are registers, I is any immediate integer value between -2048 and 2047 inclusive, and I20 is any immediate integer value between -524288 and 524287 inclusive (i.e., I and I20 are 12-bit and 20-bit signed two's-complement integers immediates). In the first three formats, any one of the operands on the right hand side can be left out, along with the operation that precedes it. Examples :

a <- a + a + 0     # type 0
b <- c * d + 3     # type 0
c <- d - e + -2    # type 0
d <- e ^ f         # type 0
e <- f - 2         # type 1
f <- 2 | g         # type 2
g <- -h            # type 1
h <- i >= 0        # type 1
i <- j < k         # type 0
j <- k             # type 0
k <- j + 0x7abcd   # type 3
l <- 0x7abcd       # type 3
m <- a + a + 0x79a # type 0

Operations

Here are the operations that tenyr supports, with their binary encodings. Operations are encoded this way to group hardware-similar operations together, differing by the most sigificant bit only. In the table below, the column header is read before the row ; e.g., & is 0b0001 and - is 0b1100.

Encoding Operator Description Encoding Operator Description
0000 | bitwise OR 1000 |~ bitwise OR with complemented second operand
0001 & bitwise AND 1001 &~ bitwise AND with complemented second operand
0010 ^ bitwise XOR 1010 ^^ pack (see below)
0011 >> arithmetic right shift 1011 >>> logical right shift
0100 + signed addition 1100 - signed subtraction
0101 * signed multiplication 1101 << left shift
0110 == bitwise equality 1110 @ test bit at position
0111 < signed less-than 1111 >= signed greater-than-or-equal

Some of the operations merit explanation. The test operations (<, >=, ==, and @) produce a result that is either 0 (false) or -1 (true). The canonical truth value in tenyr is -1, not 1. This allows us to do clever things with masks, and also explains the existence of the special &~ and |~ operations -- when the second operand is a truth value, the bitwise complement works as a Boolean NOT. The operations also support some syntactical sugar ; for example, B <- ~C is accepted by the assembler and transformed into B <- A |~ C or B <- 0 |~ C, depending on the required operation type.

The pack operation (represented by ^^) concatenates the 20 least significant bits of the left operand with the 12 least significant bits of the right operand. This operation makes it easier to construct large values in registers using immediates.

tenyr works with signed two's complement numbers - the only operation (besides bitwise operations that have no concept of sign) that is explicitly unsigned is >>>, the logical right shift. Whereas >> (arithmetic right shift) fills in shifted bits with the most significant bit of the word, >>> fills in zeros.

Memory Operations

A memory operation looks just like a register-register operation, but with one side of the instruction dereferenced, using brackets :

 D  <- [E  *  4 + F] # a load into D
 E  -> [F  << 2]     # a store from E
[F] <-  2            # another kind of store, with an immediate value

One instruction can't have brackets on both sides of an arrow, and an immediate value cannot appear on the left side of an arrow.

Instruction Shorthand

Although pieces of the right-hand-side of an instruction can be left out during assembly, under the covers all the pieces are still there ; the missing parts are filled in with zeros or with references to the special A register, which always contains 0, even if it is written to. Therefore, each instruction in the following pairs is identical to the other one in the pair :

B  <-  3       ; B  <-  A  |  A + 0x00000003
C  <-  D  *  E ; C  <-  D  *  E + 0x00000000
E  <-  1  << B ; E  <-  0x00000001  << B + A

To see the expanded form, invoke the disassembler (tas -d) with the -v option.

Control Flow

tenyr has no dedicated control-flow instructions ; flow is controlled by updating the P register, which is the program counter / instruction pointer. Reading from P will produce the address of the currently executing instruction, plus one. Writing to it will cause the next instruction executed to be fetched from the address written into P. For example, if this program starts at address 0 :

B <- P      # after this instruction, B contains 1
D <- 3      # after this instruction, D contains 3
P <- P - 3  # this is a loop back to the first instruction above

Notice that in the third instruction it was necessary to subtract 3 instead of 2, because the value in P was effectively the location of the next instruction that would have been executed in the absence of a control flow change.

Under normal circumstances, the programmer is not expected to update the P register in such a direct fashion, but rather to use a macro like jnzrel(reg,target) from common.th :

#include "common.th"
    D <- 5
    C <- 10
loop_top:
    C <- C - 1
    N <- C > D
    jnzrel(N,loop_top)

where loop_top is a label to jump to, and jnzrel means "jump if not zero to relative" (admittedly, this is not a very good name, because N needs to be -1 not merely nonzero).

Notice that we used > even though this is not one of the supported operations. The assembler accepts > and rewrites it into a valid tenyr instruction by swapping the order of the operands and using < instead. An analogous transformation occurs for <=.

Special instructions

All 32-bit words decode to a legal instruction of type 0, 1, 2, or 3. The token illegal is accepted by the assembler and encoded as 0xffffffff, which is the type3 instruction P <- [P - 1]. This instruction will update P with the value of the instruction itself, so it has the effect of P <- 0xffffffff. The simulator halts before attempting to execute the instruction at address 0xffffffff.

Labels

Labels can be used to identify segments of code and data. A label is defined by a sequence of alphanumeric characters and underscores, where that sequence cannot look like a register name (this restriction may be relaxed in the future). A label is referred to by prefixing @ to its name :

data:
    .word 0xdeadbeef
top:
    B <- C
    D <- @data
    E <- @top

Getting the value of @label directly isn't generally useful, because its value is relative to where the code or data was loaded in memory. If code was loaded in a single section at the default address of 0x1000, one would need to add 0x1000 to @data to get the absolute value in memory. This is easier when using the special label ., which gives the offset from the beginning of the current section to the current address ; then the expression P - (. + 1) will be the loading offset. This is handled by the rel() macro from common.th. rel() produces a "PC-relocated" address from a label reference.

Immediates

Immediate values in type{0,1,2} instructions are 12 bits wide, thus ranging from -2048 to 2047. In type3 instructions, they are 20 bits wide, thus ranging from -524288 to 524287. ASCII (the ASCII subset of UTF-32, really) character constants can appear in immediate expressions :

B <- '$'
C <- 4

An immediate value can also be an expression with multiple terms, as long as :

  1. all of the terms are constants
  2. the entire expression is enclosed in parentheses
  3. a @label reference occurs at most once, at the outermost nesting

The result of an immediate expression is computed in the assembler, and only the resulting immediate value is written out. Many of the tenyr operations can be used in immediate expressions, too, as well as an additional one : integer division, with /. Operator precedence within an immediate expression follows the rules for the C language.

B <- B ^ ('A' ^ 'a')    # flip case of the character in B
C <- ((124 - 1) | 1)    # after this, C will contain 123
D <- (8 / 4)            # D will contain 2
E <- (16 - 8 / 4)       # E will contain 14, not 2

Directives

There are a few assembly directives to make assembly easier :

.word 0, 1, 2, 0x1234, 'A'  # each value is expanded to 32 bits
.word (2 + @bar), (8 / 4)   # expressions are accepted by `.word`
.utf32 "Hello, world"       # each character in its own 32-bit word
.utf32 "concat" "" "enate"  # string constants concatenate
.utf32 "concat", "enate"    # string constants store consecutively
.zero 0x14                  # this creates 0x14 = 20 zeros
.global foo                 # mark symbol visible during linking

Comments

Three kinds of comments are supported : C89-style comments (non-nesting) delimited by /* and */ ; C99-style comments starting with // ; and shell-style comments starting with #. The latter two types of comments extend only to the end of line, and as there is no line continuation character, every commented line must have its own // or # character. This means that //-comments behave differently when processed by tenyr from when processed by cpp, if line continuation characters are used. #-comments are recommended ; the others are deprecated.

Reversibility

It is intended that disassembling a program and reassembling it will produce an identical binary. The disassembler takes care to explicitly emit otherwise unnecessary (zero) operands to disambiguate instructions. Any situation where an assembler-disassembler-assembler round-trip does not produce identical output on each round is a bug, and should be reported.

You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.
Press h to open a hovercard with more details.