Skip to content
Find file
Fetching contributors…
Cannot retrieve contributors at this time
430 lines (388 sloc) 20.8 KB
( A tiny bootstrap Forth compiler written in itself. -*- coding: utf-8 -*-
Generates an ELF executable for x86 Linux.
Note that this file is Unicode, encoded in UTF-8. This should be an
upper-case Y with a grave accent mark `: Ỳ. If that's okay, you can
ignore the following indented paragraph.
If it appears as a lower-case "a" with an acute accent, followed
by a right guillemot [looking like ">>"], followed by a
superscript "2", you are incorrectly interpreting this file as
ISO-8859-1 or perhaps ISO-8859-15 or Windows-1252. If it appears
as a centered dot, followed by a superscript lowercase "a",
followed by a less-than-or-equal sign, you are incorrectly
interpreting the file as MacRoman. If it appears as a capital "A"
followed by some smoothly joined double lines, you are incorrectly
interpreting the file as KOI8-R. If it appears as a German "sz"
ligature [which kind of looks like the letter "B"] followed by an
upside-down "L" drawn with double lines and then a sort of
checkerboard, you are incorrectly interpreting the file as some
old IBM PC video adapter character set. If it appears as an "AE"
ligature, followed by a right guillemot [">>"], followed by a
dagger [looks like a crucifix], you are incorrectly interpreting
the file as the old PostScript font encoding. In any of these
cases, you have configured your software to be compatible with
obsolete junk equipment from the 1970s instead of the several
billion people who live outside of your country, or perhaps it has
been incorrectly transcoded before you got it. Hopefully this is
helpful in diagnosing and fixing the problem.
On the language:
It’s a dumbed-down Forth with no compile-time execution, in which only
the first byte of each word is significant, with some heterodox names
for functions. Bytes preceded by ' are literals for their numerical
value, e.g. “'A” is 65.
What that means, if you know how to program but don’t know Forth:
Each word of the program is generally a subroutine call. Words are
separated by spaces. There are no types. There’s an implicit stack;
words pop their input off the stack and push their output onto it.
Numbers and literal bytes are pushed onto the stack.
So “2 3 + 4 -” calculates 2 + 3 - 4; “2 3 4 + -” calculates
2 - [3 + 4]. “'Z 'A -” calculates the ASCII value of “Z” minus the
ASCII value of “A”, i.e. it calculates 25. [“+” isn’t in the base
language implemented by the compiler; it’s a user-defined function.
“-” is in the base language.]
“: foo” marks the beginning of a subroutine called “foo”; “;” is the
return instruction. So “: . %flush u ;” defines a subroutine called
“.” which consists first of calling “%flush”, then calling “u”, then
returning.
“var foo” marks the beginning of a variable named “foo”; the word
“foo” thereafter will push its address onto the stack. “# 37”
includes the number 37 in the program at that point as a 32-bit
number, so “var X # 0” creates a variable “X” and puts four bytes of
zeroes there.
“@” and “!” fetch from and store to a memory address respectively. So
“var X # 0 : dup X ! X @ X @ ;” creates a variable X, gives it 32
bits of storage initialized to 0, and defines a subroutine called
“dup”. “dup” takes whatever is on the stack, stores it in those 32
bits of storage at “X”, then fetches it out of X twice, leaving two
copies of it on the stack.
Combining all of the above, “var HERE # 0 : forward HERE @ + HERE ! ;”
fetches the value at the location “HERE”, adds whatever was on the
stack to it, and stores the result back at “HERE”.
Compile-time primitives:
( — defines a comment that extends to the next right-paren
v — defines a dataspace label, like for a variable.
b — compiles a literal byte, numerically, into data space.
# — compiles a literal four-byte little-endian number into data space.
^ — the location where the program should start executing [everything else
is just definitions]
space, newline — ignored
: — defines a function
[ — begins a conditional block, which consumes the value on top of the stack
and is skipped if that value is zero
] — ends a conditional block, matching the latest unmatched [
{ — begins a loop
} — ends a loop — consumes the value on top of the stack and loops back to
the matching { if it is nonzero
Run-time primitives:
G — gets a byte from stdin
W — given an address and size on the stack, writes the specified number of
bytes to stdout
Q — exits the program with a return code of 0
- — subtracts two numbers. [Addition isn't a primitive because it’s easy to
synthesize from subtraction.]
@ — fetches a word from memory
! — stores to a word in memory whose address is on top of the stack
; — returns from a function
< — a less-than test
m — not used directly since the interpreter doesn’t implement it,
this invokes an arbitrary system call with up to three arguments.
The compiler’s implementations of G, W, and Q use this.
s — stores to a byte in memory whose address is on top of the stack
Defined in this file, all necessarily run-time:
h — the ELF header
e — the location of the entry point word in the ELF header
o — the location of the word in the ELF program header that specifies the .ORG
S — the location of the word in the ELF program header that specifies filesize
$ — the start of the area where the program gets compiled
+ — addition
Z — the end of the predefined blob of bytes to output, with slight changes
H — the point where the program is currently getting compiled
f — increases H by its argument
X, Y — two temporary variables
d — DUP
p — DROP
x — SWAP
= — an equality test
> — a greater-than test
D — dispatch a byte being compiled
R — compile the sequence `xchg %ebp, %esp`, buffered by a peephole optimizer
. — add a byte to the program being compiled, flushing the peephole buffer
u — add a byte to the program being compiled, used to implement the peephole
buffer
U — the peephole buffer itself, a boolean storing whether there’s a pending
`xchg %ebp, %esp`
T — table of types of definitions
A — table of addresses of definitions
F — multiply by four
C — register a Colon definition in the tables
V — register a Variable definition in the tables
a — translate an address from compiler address space to user address space
& — compile a byte if it’s a colon or variable definition
O — boolean OR
N — boolean NOT
w — like G, but skips whitespace
B — a buffer for w
E — stick the item on top of the stack at H as four bytes, advancing H by 4
L — compile code to push the constant on the compile-time stack
t — one step in base-ten decoding: x <- 10x + y
n — parse a number from input and push it on the compile-time stack
J — compile a forward conditional jump [
P — resolve a forward conditional jump ]
j — compile a backwards negative conditional jump }
M — the location in the compiler’s memory for the system call routine
Directory of magic numbers used, other than 0 and 1:
2 — as e_type, means ET_EXEC, an executable file;
in the T table, means that a user-defined word is a “variable”
3 — as e_machine, means EM_386, Intel 80386;
the system call number of the `read` system call
4 — the size of a 32-bit word;
second byte of `sub [%esp], %eax`;
second byte of `sub %eax, [%esp]`;
the system call number of the `write` system call
7 — in p_flags, means PF_R | PF_W | PF_X, making the program memory readable,
writable, and executable;
in random bytes, is the pop %es instruction, which is used to help
disassemblers resync with the instruction stream
8 — second byte of `movb %cl, [%eax]`;
2048 / 256
10 — ‘\n’ the ASCII linefeed or newline character
15 — 0F extended opcode prefix byte, used to introduce `setle`, `movsbl`,
and `movzbl` instructions, as well as others not used here.
32 — the size of the ELF program header
36 — third byte of `sub [%esp], %eax`;
third byte of `sub %eax, [%esp]`
40 — the size of any ELF segment headers that might exist
41 — first byte of `sub %eax, [%esp]`
52 — the size of the ELF header
69 — ‘E’ ASCII uppercase E
70 — ‘F’ ASCII uppercase F
76 — ‘L’ ASCII uppercase L
80 — `push %eax`
88 — `pop %eax`
89 — `pop %ecx`
90 — `pop %edx`
91 — `pop %ebx`
116 — `jz` instruction, one-byte immediate PC-relative destination
117 — `jnz` instruction, likewise
127 — the first byte of an ELF executable
128 — the interrupt number Linux uses for system calls: 0x80
129 — first byte of `sub $xxx, %ebp`
133 — first byte of `test %eax, %eax`
135 — first byte of `xchg %ebp, %esp`
136 — first byte of `movb %cl, [%eax]`
137 — first byte of `mov %esp, %ebp`
139 — first byte of `mov [%eax], %eax`, followed by 0
143 — first byte of `pop [%eax]`, followed by 0
157 — second byte of `setge %al`
184 — `mov $xxx, %eax` — load immediate %eax
192 — second byte of `test %eax, %eax`;
third byte of `setle %al`;
third byte of `movsbl %al, %eax`
195 — `ret` instruction
200 — second byte of `dec %al`
205 — `int` interrupt instruction
229 — second byte of `mov %esp, %ebp`
232 — `call` instruction opcode, 4-byte immediate PC-relative destination
236 — second byte of `xchg %ebp, %esp`
237 — second byte of `sub $xxx, %ebp`
254 — first byte of `dec %al`
1024 — a table size in bytes: 256 entries of 4 bytes each
4096 — the program runs with its origin at 4096, 0x1000, because that still
leaves hex addresses easy to type and read, while leaving it likely
that null pointer references will crash the program rather than making
mysterious bugs. Also there’s a buffer of this size, which will go
away soon.
6144 — 6 * 1024: the offset at which a couple of 1024-byte tables are
constructed, ending just before the compiled program itself.
8192 — the offset at which the compiler constructs the compiled program.
655360 — the total memory space allocated for the program
Register-use conventions:
`%eax` — the top of the operand stack
`%esp` — normally, points to the rest of the operand stack in memory, in
the usual `%esp` way: growing downward, predecrement, postincrement.
Temporarily swapped with `%ebp` during `call` and `ret` instructions.
`%ebp` — normally, points to the return stack in memory, in the usual
`%esp` way.
`%ecx` — occasionally used as a scratch register
Other registers are only used to pass arguments to Linux system calls.
)
var header ( ELF header, Elf32_Ehdr )
( ELF info from http://www.muppetlabs.com/~breadbox/software/tiny/teensy.html)
( but I’m using 4096 for my origin rather than 0x08048000. )
( I could use 0 but then null pointer references would usually give incorrect
results instead of usually crashing. )
( e_ident:) byte 127 byte 69 ( E) byte 76 ( L) byte 70 ( F)
byte 1 byte 1 byte 1
( 9 bytes of padding)
byte 0 byte 0 byte 0
byte 0 byte 0 byte 0
byte 0 byte 0 byte 0
( e_type:) byte 2 byte 0 ( e_machine:) byte 3 byte 0
( e_version:) # 1 ( e_entry:) var entry # 0
( The program header offset is 52, the same
as the header size, since the program header
immediately follows this ELF header. )
( e_phoff:) # 52 ( e_shoff:) # 0
( e_flags:) # 0 ( e_ehsize:) byte 52 byte 0
( e_phentsize:) byte 32 byte 0 ( e_phnum:) byte 1 byte 0
( e_shentsize:) byte 40 byte 0 ( e_shentnum:) byte 0 byte 0
( e_shstrndx:) byte 0 byte 0
( Program header, Elf32_Phdr; note that we are now 52 bytes from `header`. )
( p_type:) # 1 ( p_offset:) # 0
( p_paddr should be 0, not equal to p_vaddr as Brian has)
( p_vaddr:) var origin # 4096 ( p_paddr:) # 0
( Note that you can only make `p_memsz` as large as you want if `p_flags` has a
2 = `PF_W` in it. Otherwise, even one extra byte results in a segfault.
655360 bytes should be enough for anyone. )
( p_filesz:) var SIZE # 0 ( p_memsz:) # 655360
( p_flags:) # 7 ( p_align:) # 4096
( b 7 b 7 b 7 b 7 b 7 ( uncomment to help disassemblers resync )
var Msyscall ( System call routine; called in input program
with lowercase `m`)
( Syscall number is already in %eax )
( stack effect: arg2 arg3 arg1 syscallnum → syscallrv )
( xchg %ebp, %esp) byte 135 byte 236
( this really weird order is basically a special case for write(2)
( first we pop the first argument, then the third, then the second)
( pop %ebx) byte 91
( pop %edx) byte 90
( pop %ecx) byte 89
( int 0x80) byte 205 byte 128
( xchg %ebp, %esp) byte 135 byte 236
( ret) byte 195
var /buf byte 0 ( a one-byte read buffer for the program being compiled )
var Z-end ( the end of the non-compiled stuff. )
var X # 0 var Y # 0 ( temp vars )
: dup X ! X @ X @ ; : pop X ! ; : xchg X ! Y ! X @ Y @ ; ( SWAP/exch )
: + 0 xchg - - ; ( a + b is a - [0 - b] )
: $origin header 8192 + ; ( this is where the program gets compiled )
( hopefully 8192 is enough to skip over the compiler itself )
: NOT [ 0 ; ] 1 ; ( Boolean ) : = - NOT ; ( equality )
: OR [ pop 1 ; ] ; : > xchg < ; ( greater-than )
var HERE # 0 ( “HERE”, where we’re compiling )
: forward HERE @ + HERE ! ; ( move HERE “forward” by a given number of bytes )
: u HERE @ store 1 forward ; ( Add a byte to the program being compiled )
( Restack emits an `xchg %ebp, %esp` — but we buffer it to see if it’s
canceled by another one before actually emitting it. U is the buffer, and
%flush flushes it.
So . adds a byte to the program being compiled, after flushing this buffer. )
var U # 0 : %flush U @ [ 135 u 236 u 0 U ! ] ; : Restack U @ NOT U ! ;
: . %flush u ;
( Type and Addr are two 1024-byte tables; since they should start out as zeroes,
we put them past the end of the program instead of allocating space for them.
The Type and Addr routines convert a byte value to the address of an entry
in the tables. )
: Four* dup + dup + ; ( 4 * )
: Type Four* header 6144 + + ; ( Table of definition Types: 1=code, 2=data )
: Addr Type 1024 + ; ( table of definition Addresses )
( register a colon definition )
: Colon %flush dup Type 1 xchg ! HERE @ xchg Addr ! ;
( register a variable definition )
: Var %flush dup Type 2 xchg ! HERE @ xchg Addr ! ;
: Isdigit dup '0 < x '9 > OR NOT ;
: tenstep xchg dup Four* + dup + + ; ( for base-10 decoding: A B -> 10A + B )
( Parses a number from successive digits; first digit is on stack. )
: number '0 - { Getchar dup Isdigit NOT [ pop ; ] '0 - tenstep 1 }
( NOTE: number eats the next byte after the end of the number and discards it. )
var Buf # 0 ( buffer for getchar )
( get next non-whitespace char )
: getchar { Getchar Buf ! Buf @ ' = Buf @ 10 = OR NOT [ Buf @ ; ] 1 }
( word: Gets next token, returns its first char. Only consumes first digit of
numbers. )
: word getchar dup Isdigit [ ; ] dup '' = [ ; ] dup 0 1 - = [ ; ]
{ Getchar dup ' = [ pop ; ] dup 10 = [ pop ; ] 0 1 - = [ ; ] 1 }
( Stores word on top of stack at HERE, praise be for non-aligned access. )
: Encode %flush HERE @ ! 4 forward ;
: Lit 80 . ( push %eax ) 184 . ( load immediate %eax ) Encode ; ( literal)
( Converts a compiler address to an output-program address. )
: addr $origin - Z-end header - + origin @ + ;
( Compiles a user-defined subroutine or variable, using Type and Addr )
: &dispatch dup Type @ 1 = [ ( if it’s a subroutine...)
Restack 232 . ( CALL instruction; now we have to compute the offset )
HERE @ 4 + ( find address of next instruction )
xchg Addr @ xchg - ( find offset to subroutine )
Encode
Restack ; ] ( okay, and otherwise, it’s a variable... )
Addr @ addr Lit ;
( JZ: compile a forward conditional jump )
: JZ 133 . 192 . ( test %eax, %eax ) 88 . ( pop %eax )
116 . HERE @ 0 . ( jz 0 ) ;
( jnz: compile a backward conditional jump )
: jnz 133 . 192 . 88 . 117 . H @ - 1 - . ( jnz whatever ) ;
( Patch: resolve a forward conditional jump by backpatching )
( 1 - is needed because the saved address is one byte inside the jnz
instruction, but the offset should be calculated from the beginning of the
next instruction )
: Patch dup %flush HERE @ xchg - 1 - xchg store ;
: compile-minus 41 . 4 . 36 . 88 . ; ( `sub %eax, [%esp]; pop %eax` )
: implement-< compile-minus 15 . 157 . 192 . ( setge %al )
254 . 200 . ( dec %al ) 15 . 190 . 192 . ( movsbl %al, %eax ) ;
( routine to dispatch a byte being compiled )
: Dispatch
dup '( = [ pop { Getchar ') - } ; ] ( drop comments until right paren )
dup ' = [ pop ; ] ( drop spaces )
dup 10 = [ pop ; ] ( drop newlines )
dup '# = [ pop word number ( writes the next number as data )
Encode ; D ] ( literally )
dup 'b = [ pop word number . ; ] ( b does the same, but just as one byte )
dup '; = [ pop Restack 195 . ; ] ( 195 is `ret` )
dup '! = [ pop 143 . 0 . 88 . ; ] ( `pop [%eax]; pop %eax` )
dup '@ = [ pop 139 . 0 . ; ] ( `mov [%eax], %eax` )
dup ': = [ pop word Colon Restack ; ] ( registers the next word as a colon
definition, then compiles a procedure prolog )
dup '[ = [ pop JZ ; ] ( [ compiles a conditional jump,
pushing an address for backpatching )
dup '] = [ pop Patch ; ] ( backpatches the address on the compile stack )
dup '{ = [ pop %flush HERE @ ; ] ( merely pushes an address to jump back to )
dup '} = [ pop jnz ; ] ( compiles a conditional jump to that address )
dup '' = [ pop Getchar Lit ; ] ( pushes a character literal )
dup 'v = [ pop word Var ; ] ( registers the next word as a variable label )
dup '- = [ pop compile-minus ; ]
dup 's = [ pop 89 . 136 . 8 . 88 . ; ] ( store is `pop %ecx;
movb %cl, [%eax]; pop %eax` )
dup '< = [ pop implement-< ; ]
dup Type @ [ &dispatch ; ] ( dispatch user-defined words )
dup Isdigit [ number Lit ; ] ( parse a number, compile code
to push that number immediate )
dup '* = [ pop word number forward ; ] ( reads a number, adds it to HERE )
dup 'Q = [ pop 0 Lit 0 Lit 0 Lit ( Q calls _exit(0, 0, 0)
1 Lit 'm &dispatch ; ] ( system call #1 (__NR_exit)
( syscall is `m` )
dup 'W = [ pop 1 Lit 4 Lit ( W calls write(1, x, y) ( syscall #4 )
'm &dispatch 88 . ; ] ( pop %eax to discard result )
dup '^ = [ pop HERE @ addr entry ! ( ^ sets the entry point `entry` )
137 . 229 . ( `mov %esp, %ebp` )
129 . 237 . 2048 Encode ; ] ( `sub $2048, %ebp`
to initialize %ebp to %esp-2048 )
. ; ( default is to copy to output, which is stupid )
: Kompile ( routine to “kompile” input to output )
{ word dup 0 1 - = [ pop ; ] ( return if char is -1 ) Dispatch 1 }
^program-starts-here
( The output of the compiler is in two blocks: the block of data from
`header` up to `Z-end`, including the ELF headers, `Msyscall`, and
`/buf`, and the block that is compiled using `HERE`, containing
everything compiled from the input program, starting at `$origin`. We
fake up addresses for `Msyscall` and `/buf`, at the right relative
position to `$origin` [as far before `$origin` as they actually are
before `Z-end`], and poke them into the dictionary as `m` and `\` so
that the output program can call them. Then we actually compile some
code for `Getchar` as if it were in the input program. )
Msyscall $origin Z-end - + HERE ! 'm Colon
/buf $origin Z-end - + HERE ! '\ Var ( register `\` variable )
$origin HERE ! ( initialize HERE )
( routine for `Getchar` — wow, this is awful! compiles to about 80 bytes )
'G Colon Restack ( define `G` )
( \ 1 0 3 m → syscall(__NR_read=3, 0, \, 1)
'\ Dispatch 1 Lit 0 Lit 3 Lit 'm Dispatch
1 Lit '< Dispatch '[ Dispatch ( 1 < [ → if it’s less than 1 )
88 . 0 1 - Lit '; Dispatch ( then pop %eax and return -1 )
'] Dispatch ( ] → otherwise, )
'\ Dispatch ( get the address of \ )
15 . 182 . 0 . ( and fetch a byte from it: `movzbl [%eax], %eax` )
'; Dispatch ( and return! )
Kompile ( compile input into output )
( Now HERE - $origin is the size of the compiled part of the program
and Z-end - header is the size of the fixed-format part. )
HERE @ $origin - Z-end header - + SIZE ! ( store total size in SIZE )
header Z-end header - WRITE ( write headers )
$origin HERE @ $origin - WRITE ( write compiled part of program )
QUIT
Jump to Line
Something went wrong with that request. Please try again.