Permalink
Cannot retrieve contributors at this time
Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.
Sign up
Fetching contributors…
| # Simple token-threaded "Forth" interpreter. -*- asm -*- | |
| # Version 3. | |
| # by Kragen Javier Sitaker; dedicated to the public domain, | |
| # i.e. I relinquish whatever exclusive rights copyright law | |
| # gives me with regard to this work. | |
| # Major parts taken from Richard W.M. Jones's public-domain | |
| # JONESFORTH 42 by Richard W.M. Jones <rich@annexia.org> | |
| # http://annexia.org/forth | |
| # As of version 2, this program just outputs the following under | |
| # Linux: | |
| # hell, world, hello | |
| # -120 1 104 | |
| # s: 102 101 100 | |
| # 397 | |
| # lit8 lit ret execute (if) (else) (loop) ! @ c! c@ rp@ rp! r> >r sp@ sp! pop dup over swap 0< & | ^ um+ /% um* r@ syscall5 syscall3 syscall1 bye 0 1 type hello world comma count cr -1 + 1- rot -rot tuck emit u. (u.) ((u.)) ~ 1+ negate . scolon .s pick cells * depth - / cellsize nip (do) i 2dup 2drop dict dictp dictsize nextword pastdict? words < >= 0= cbcmp c@+ bcmp memcmp unloop find r2@ 2swap | |
| # hello134517863 | |
| # , 0 | |
| # 134517710 | |
| # to compile: | |
| # gcc -m32 -nostdlib -static -o tokthr tokthr.S | |
| ### Why Small Things are Interesting | |
| # There are still a lot of computers out there that have tens of | |
| # kilobytes of memory or less, and they cost a lot less than, | |
| # say, a cellphone. Cellphones are apparently still too | |
| # expensive for half the world's population. I want to see how | |
| # close I can get to having a comfortable programming | |
| # environment in a smaller device. | |
| # Some smallish microcontroller chips from five different | |
| # manufacturers, with current Digi-Key prices: | |
| # Name bytes RAM bytes ROM MHz price | |
| # ATtiny2313 128 2048 20 US$1.38 | |
| # ATMega48-20AU 512 4096 20 US$1.62 | |
| # MSP430F1111AIPW 128 2264 8 US$2.43 | |
| # LPC2101 2048 8192 70 US$2.52 | |
| # H8/300H Tiny 1536 8192 12 US$3.58 | |
| # M16C/R8C/Tiny/1B 1024 16384 12 US$3.54 | |
| # SX28AC/SS 136 3072 50 US$2.79 | |
| # There are essentially no 386-compatible devices in this price | |
| # range as far as I can tell; for why I'm not worried about | |
| # that, see the section "How Small This Is". | |
| # More practically and short-termly, small projects can take | |
| # less time to finish, and I feel like I need to learn about | |
| # different approaches to implementing programming languages. | |
| ### Why This is Small | |
| # The normal Forth representation of a function is as an array | |
| # of pointers to the other functions it calls, in sequence; a | |
| # few of those other functions may move the interpreter pointer | |
| # around in that array, or snarf up a constant that's stored in | |
| # the array, or stuff like that, but for the most part, the | |
| # functions just get called in sequence. This is called | |
| # "threaded code" and it's fairly compact, especially on 16-bit | |
| # systems where the pointers are only two bytes. | |
| # A traditional approach taken by Forth implementations to | |
| # reduce code size even further is called "token threading". | |
| # Rather than making arrays of 16-bit or 32-bit pointers, they | |
| # make lists of 8-bit indices into an array of pointers. This | |
| # has two advantages: | |
| # 1. the indices are one fourth the size of a list of 32-bit | |
| # pointers; | |
| # 2. it is possible to save these lists of indices somewhere | |
| # outside of memory and continue to use them even after | |
| # making some changes to the code, as long as the same | |
| # indices in the table have the same meanings. So, for | |
| # example, you could write some boot firmware in this | |
| # "bytecode". | |
| # It also has some disadvantages: | |
| # 1. You run out of space in the table. Even a fairly minimal | |
| # full Forth system contains close to 256 subroutines. You | |
| # can mitigate this by packing, say, two 12-bit pointers | |
| # every three bytes, or maybe by having a special bytecode | |
| # that looks up the next byte in an extended table. | |
| # 2. It's slower and makes the machine-code part of the program | |
| # take more space. The traditional LODSW; JMP AX version of | |
| # $NEXT from the eForth Model, which fetches and executes the | |
| # next execution token in the threaded list, is three bytes | |
| # and two instructions; my 'next' here is 41 bytes and 14 | |
| # instructions, which is big enough that I jump to it (2 | |
| # bytes) rather than making an assembler macro. Which blows | |
| # your branch target buffers to pieces. Oh well. The | |
| # performance penalty is probably two orders of magnitude | |
| # over native code, but I haven't measured it yet. I | |
| # measured an earlier version on my 700MHz PIII laptop on an | |
| # empty loop at about a factor of 3.5 over simple | |
| # direct-threading, which in turn is on the order of 10 times | |
| # slower than machine code. | |
| # Anyway, so this is an example program built using this | |
| # technique. It implements two Forthlike stacks and interpreted | |
| # subroutines, but not yet the ability to define new subroutines | |
| # at run-time. | |
| ### What's Here | |
| # I've implemented all of the primitives from C. H. Ting and | |
| # Bill Muench's public-domain (?) eForth Model 1.0, except for | |
| # the following: | |
| # - I haven't implemented their lowercase "next" (as in FOR | |
| # NEXT) because I think it's a bad idea, it's complex, and it | |
| # can be implemented at a higher level if you really need it; | |
| # instead, I implemented (loop). | |
| # - I didn't implement !IO because it's a no-op in this context; | |
| # - I haven't yet implemented ?RX, although I think it's | |
| # possible to implement it on top of syscall5, using select(); | |
| # Additionally, I implemented multiply and divide primitives. | |
| # However, some of it is untested and therefore probably broken. | |
| # Procedure call and return and the system calls do work. | |
| # Currently registers are used as follows: | |
| # %esi --- interpreter pointer; points to next byte to execute | |
| # %ebp --- return stack pointer; points to last thing pushed. This stack, | |
| # like the other one, grows downwards. | |
| # %esp --- data stack pointer; points to last thing pushed. This is | |
| # the processor's standard stack pointer; "push" and "pop" | |
| # instructions use it, which makes assembly code to use it | |
| # quite concise. The Intel "call" and "ret" instructions | |
| # would also use this stack, but they aren't used in this | |
| # program. | |
| # flags --- the "down" direction flag must be cleared. | |
| # It's probably missing a couple of primitives needed because of | |
| # the token-threading implementation strategy; the address of | |
| # the token table probably needs to be knowable, at least. | |
| # Direct and indirect threading, the normal Forth approaches to | |
| # allowing unrestricted coexistence of words written in assembly | |
| # language and interpreted Forth, both had heavy space costs | |
| # here --- close to 100% for the bytecode currently in the | |
| # system. So the inner interpreter checks, for each bytecode, | |
| # whether it is in the range of bytecodes whose interpretations | |
| # are in native code, and picks the relevant code path. This | |
| # avoids consuming any space per-word for this distinction, but with | |
| # what appears to be a heavy performance cost. | |
| # This is similar to an approach called "bit threading"; as | |
| # explained by Jeff Fox in a comp.lang.forth message | |
| # 2007-05-13, on thread "Cases where Forth seems a little | |
| # clunky": | |
| # I have seen hardware and software implemenations of | |
| # bit-threading where the msb of the address space | |
| # selects between threaded code address lists and | |
| # addresses of CODE subroutines. In both cases 0 is a | |
| # valid address and negative addresses are valid. I think | |
| # this applied to Novix. | |
| # Except that this approach uses a fraction of a bit for most | |
| # tokens instead of a whole bit. | |
| ### How Small This Is | |
| # As a point of comparison, eForth 1.0's machine-code part seems | |
| # to be 171 instructions and 399 bytes, including some data | |
| # that's mixed in there with it. | |
| # As of version 3, the machine-code part of this interpreter is | |
| # 239 bytes in 129 machine-code instructions, plus 19 bytes of | |
| # read-only data, 172 bytes of token table for the 86 | |
| # currently-defined words (30 of which are primitive), 397 | |
| # bytes of their names, and another 370 bytes of bytecode in | |
| # the 56 bytecode words (including a couple of data values | |
| # scattered around), and then another 56 bytes in the main | |
| # program, for a total of 1253 bytes. So the program is | |
| # already less than half written in assembly, in terms of | |
| # object-code size. | |
| # As of version 3, the machine-code part of this program uses | |
| # only 25 different instructions: cld; jmp, jz, jnc, jbe; push, | |
| # pop, lodsb, lodsl, xchg, mov; cmp, movsbl, inc, and, xor, or, | |
| # lea, cdq, rcl, add, idiv, imul; and int. Interestingly, many | |
| # of them are only used once. | |
| # Just before version 3, the non-comment lines were 14% | |
| # assembler macros, 51% assembly language, and 35% bytecode. | |
| # From Brad Rodriguez's January/February 1993 Computer Journal | |
| # article, "Moving Forth: Part 1: Design Decisions in the Forth | |
| # Kernel": | |
| # You can expect the usual 16-bit Forth kernel (see below) | |
| # to occupy about 8K bytes of program space. For a full | |
| # kernel that can compile Forth definitions, you should | |
| # allow a minimum of 1K byte of RAM. | |
| # I'm pretty sure I can beat the 8K requirement by quite a bit | |
| # and still be able to compile Forth definitions --- I'm hoping | |
| # for a factor of 4. Consider, from Jeff Fox's "Thoughtful | |
| # Programming", chapter 3: http://www.ultratechnology.com/forth3.htm | |
| # People assume that since Chuck has refined his Forth down to | |
| # about a 1K object that this means he has just stripped his | |
| # Forth down to a 1K kernel that will boot like in the old days | |
| # and that he is going to compile a complete Forth system on top | |
| # of the 1K before he starts an application. This is wrong. The | |
| # complete Forth system is 1K, and the reason for that is | |
| # maximize Chuck's productivity. What stops people from doing | |
| # what they need to do to solve a problem is all the time spend | |
| # solving all the related sub-problems that pop up as a result | |
| # of complex interconnections between components. To maximize | |
| # his productivity Chuck minimizes the number of these side | |
| # problems that pop up. Keep it simple, and don't get to where | |
| # you are spending 90% or 99% or you time dealing with related | |
| # sub-problems. Avoid unsolvable problems, don't waste your time | |
| # trying to solve them. | |
| # Consider also this quote from Elizabeth Rather: | |
| # As you have seen, so much depends on the specific | |
| # machine architecture. We implemented a TTC [token | |
| # threaded] Forth on some low-end AVR processors with | |
| # very limited code space, and it ran faster than a | |
| # native-code 8051 at comparable clock speed. | |
| # (2007-06-23 comp.lang.forth post on thread "Build your own | |
| # Forth for Microchip PIC (Episode 838): Threading") | |
| # http://objectmix.com/forth/168105-build-your-own-forth-microchip-pic-episode-838-threading.html | |
| # Consider also this quote from the abstract of Frank | |
| # Sergeant's 1991 "A 3-Instruction Forth for Embedded Systems | |
| # Work: Illustrated on the Motorola MC68HC11": | |
| # A 3- instruction Forth makes Forth affordable for | |
| # target systems with very limited memory. It can be | |
| # brought up quickly on strange new hardware. You don't | |
| # have to do without Forth because of memory or time | |
| # limitations. It only takes 66 bytes for the Motorola | |
| # MC68HC11. Full source is provided. . . . The absolute | |
| # minimum the target must do, it seems to me, is fetch a | |
| # byte, store a byte, and call a subroutine. | |
| # http://pygmy.utoh.org/3ins4th.html | |
| # As I said before, there are no small, cheap 386s, and so of | |
| # course the code size of this version is an approximation, and | |
| # it won't be a simple recompile to port it to one of these | |
| # other architectures; but 129 instructions' worth of assembly | |
| # are probably not that big a deal to rewrite for a new | |
| # platform. (I'll probably want to write multiply and divide | |
| # routines, though.) | |
| ### Other Things I Tried | |
| # I tried switching to caching the top of the data stack in a | |
| # register, on the theory that it would shorten things like | |
| # 'and'. Currently 'and' is pop %eax; pop %ebx; and %ebx, | |
| # %eax; push %eax; jmp next. If top of stack is cached in %eax | |
| # instead of being stored in memory, this becomes pop %ebx; and | |
| # %ebx, %eax; jmp next, which is considerably shorter. | |
| # However, most things don't change, and other things become | |
| # longer due to the extra work to save top-of-stack. I tried | |
| # using both %ebx and %eax as the top-of-stack cache. | |
| # In the version using %ebx as top-of-stack, the total size of | |
| # the machine-code part was 216 bytes, 115 instructions, | |
| # compared to 197 bytes, 112 instructions for the version using | |
| # the current strategy. In the version using %eax as | |
| # top-of-stack, it was only 215 bytes, but that's still worse | |
| # than 197. | |
| # In previous versions, all routines were machine-code routines | |
| # that you could just jmp to. High-level bytecode words began | |
| # with "call dolist", which took the saved %eip off the stack | |
| # and stuck it in %esi. Unfortunately, that added 5 bytes to | |
| # each bytecode word; in version 2, the bytecode region is 221 | |
| # bytes and contains 36 word definitions (the 30 machine-code | |
| # primitives aren't defined there) --- so 5 bytes each would | |
| # have been 180 bytes of overhead, or 82%! It also would have | |
| # required a region to be both executable and writable to | |
| # support run-time routine definition, which is kind of a pain | |
| # thse days, and also on Harvard-architecture microcontrollers. | |
| # In previous versions, the token-table entries were 32 bits | |
| # each (instead of 16 bits as they are now), which added | |
| # another 2 bytes of overhead per word. In version 2, there | |
| # are currently 66 words, so that's another 132 bytes of shaved | |
| # overhead. | |
| ### Performance (Speed) | |
| # To version 2, I added a simple program to print out a | |
| # badly-formatted 8-bit extended ASCII table; it was 11 | |
| # bytecode operations long. It executed 81615 bytecodes in all | |
| # (according to gdb). On my 700MHz PIII-Coppermine laptop from | |
| # 1999, 'time' reports CPU times varying from 4-12 | |
| # milliseconds. So it seems like it can execute about 10 000 | |
| # bytecodes in a millisecond, or about 10 million bytecodes in | |
| # a second --- about 100ns per bytecode. | |
| # It also made 1459 system calls. | |
| # That's slightly faster than Python's bytecode engine (maybe | |
| # --- probably within measurement error --- and anyway Python's | |
| # bytecodes are larger-grained), and an order of magnitude | |
| # slower than simple direct-threading, and about three times | |
| # slower than direct-threading with a simple bytecode | |
| # indirection layer. | |
| # I would be surprised if version 3 had any measurable change | |
| # in speed from version 2. | |
| # To version 3, I added a little loop to repeatedly search the | |
| # 86-word wordlist for ", ", which isn't there. I was able to | |
| # do this search 10 000 times in 4.59 CPU seconds, which is | |
| # 5.34 microseconds per comparison, or about 3700 CPU cycles. | |
| # If we eventually have 300 words in the wordlist, each search | |
| # will therefore take 1.6 milliseconds in the worst case, so | |
| # the system won't be able to compile or interpret more than | |
| # about 50-100 lines of code per second. It should probably be | |
| # pretty easy to fix this particular performance problem | |
| # (recode wordlist search in asm, restructure, or whatever) if | |
| # it comes up, but it's a bit worrisome... | |
| ### What's Wrong With This Program | |
| # - It's a long way from doing anything useful. | |
| # - There's 21 instructions of unused code which may be broken. | |
| # In Version 3, these words are in use and so probably work: | |
| # hello, sub1, type, comma, world, newline, dolit_s8, dot, | |
| # bye, exit, branch_on_0, c_bang, drop, dup, swap, negative, | |
| # umplus, divmod, syscall5, syscall3, zero, syscall1, rpop, | |
| # rpush, one, dolit_32, neg1, add, emit, tuck, udot, | |
| # udot_nospc, udot_nonzero, branch, invert, add1, negate, | |
| # xor, printstack, cells, mul, depth, div, sub, cellsize, | |
| # nip, pick, twodup, i, _do, r_at, over, at, sp_at, bang, | |
| # twoswap, r_2at, find, unloop, memcmp, bcmp, c_at_inc, | |
| # cbcmp, zeq, ge, lt, words, pastdict, nextword, dictsize, | |
| # dictp, dict, c_at. These are not tested, and therefore may be | |
| # broken: execute, rp_at, rp_bang, sp_bang, and, or. | |
| # - There's no dictionary structure yet. | |
| # - It probably needs another couple of primitives. | |
| # - There's no checking for stack overflow or underflow, but | |
| # they will break things. | |
| # - It's slow; see above about performance. | |
| ### The Beginning of the Program | |
| # .. include system library header so we know __NR_exit = 1 and | |
| # __NR_write = 4 | |
| #include <asm/unistd.h> | |
| ### The token table and dictionary | |
| # To save space, we're trying to avoid storing pointers | |
| # as much as possible. So most of the code is | |
| # represented as "tokens", which are offsets into the | |
| # "token table", which contains 16-bit offsets into | |
| # either the machine-code primitives or the data | |
| # segment. | |
| # The "dictionary" is stored in this program as a | |
| # sequence of strings, stuffed next to each other with | |
| # no intervening pointers at all. The idea is that if | |
| # you want to execute or compile a word, you walk | |
| # through the dictionary, examining each string in turn | |
| # to see if it's the right word. If so, then the | |
| # number of strings you rejected is the index into the | |
| # token table. | |
| # This implies that the dictionary needs to be stored | |
| # somewhere where it can expand without overwriting | |
| # other stuff. So I'm putting it in its own section | |
| # for the time being. I'm pretty sure that means it'll | |
| # get at least a page, which is enough space to define | |
| # at least several hundred words. Maybe at some future | |
| # time I'll copy it into .bss at boot time instead. | |
| .data 1 # Start putting stuff in data subsection 1 | |
| .align 4 | |
| ## table to define the "bytecode" instructions | |
| token_table: | |
| # "a" means "allocatable", "w" means "writable" | |
| .section .dictionary, "aw" | |
| dictionary: | |
| ## I was frustrated with the unreadability of my | |
| ## bytecode lists; I was counting token table entries | |
| ## by hand and writing bytecodes numerically. So I | |
| ## wrote a macro to help. | |
| ## Note that we are using a separate .subsection | |
| ## directive because gas 2.17 doesn't support putting | |
| ## that in the .pushsection line, even though it is | |
| ## documented to do so; see message from Maciej | |
| ## W. Rozycki on 2007-10-11, subject "Re: How to use | |
| ## .pushsection?", | |
| ## http://sourceware.org/ml/binutils/2007-10/msg00176.html | |
| ## for more details) | |
| ## The first few entries in the table of bytecodes are | |
| ## all defined in machine code; the rest are all | |
| ## defined in bytecode. The inner interpreter examines | |
| ## each bytecode to determine which category it falls | |
| ## in in order to figure out how to execute it, | |
| ## including what base address to add its offset to. | |
| ## This sucks for extensibility but rocks for | |
| ## compactness. | |
| .macro countedstring name | |
| .byte stringlength\@ | |
| 1: .ascii "\name" | |
| ## Here we count the length of the string --- computers | |
| ## are for counting bytes so people don't have to! | |
| stringlength\@ = . - 1b | |
| .endm | |
| .macro define_bytecode name, realname, origin | |
| .pushsection .data # save current position, go to data section | |
| .subsection 1 # and subsection 1, where we put the addrs | |
| b_\name = (. - token_table) / 2 # define b_foo as the index of this ptr | |
| .ifeq b_\name - 256 | |
| .error "\name got bytecode 256" | |
| .endif | |
| .short \name - \origin # insert offset which will be resolved next | |
| .popsection # return to where we were, and | |
| .pushsection .dictionary | |
| countedstring "\realname" | |
| .popsection | |
| \name: # define the name | |
| .endm | |
| .macro defasm name, realname | |
| define_bytecode \name, "\realname", machine_code_primitives | |
| .endm | |
| .macro defbytes name, realname | |
| define_bytecode \name, "\realname", bytecode_start | |
| .endm | |
| ### The Return Stack | |
| # We put Forth return addresses here, but programs can also use | |
| # it for other purposes. | |
| .bss | |
| .space 4096 | |
| initial_return_stack_pointer: | |
| ### Initialization | |
| .text # the following stuff goes in the text segment | |
| .global _start # declare _start as a global symbol | |
| # (otherwise ld won't be able to find it) | |
| _start: # this is the entry point for ELF I guess | |
| cld # clear direction flag; unnecessary? | |
| mov $initial_return_stack_pointer, %ebp | |
| mov $instructions, %esi # %esi is the interpreter pointer register | |
| jmp next # and now we start the interpreter. | |
| # (somewhat silly since we could just | |
| # fall through..) | |
| ### The Machine-Code Primitives | |
| # Also next (aka the address interpreter or inner interpreter) | |
| # is in this section. | |
| machine_code_primitives: | |
| # dolit_s8 takes a signed 8-bit literal from the instruction | |
| # stream and pushes it onto the stack. | |
| defasm dolit_s8, "lit8" | |
| lodsb | |
| movsbl %al, %eax | |
| jmp pusheax | |
| defasm dolit_32, "lit" # more general dolit | |
| lodsl | |
| jmp pusheax | |
| defasm exit, "ret" # Return from a colon defn. | |
| xchg %ebp, %esp | |
| pop %esi | |
| xchg %ebp, %esp | |
| jmp next | |
| defasm execute, "execute" # Run xt on data stack. | |
| pop %eax # Here 'xt' is the one-byte token. | |
| jmp execute_eax | |
| # Branch if top of stack is 0 (implementing IF). | |
| # Both branch instructions take a signed byte offset from the bytecode | |
| # stream. | |
| defasm branch_on_0, "(if)" | |
| pop %eax | |
| and %eax, %eax | |
| jz branch | |
| inc %esi # skip 1-byte jump offset | |
| jmp next | |
| defasm branch, "(else)" | |
| lodsb | |
| movsbl %al, %eax # same insn size as cbtw; cwde | |
| add %eax, %esi | |
| jmp next | |
| # (loop) is what we do at the end of a DO LOOP. | |
| # DO puts a number on top of the return stack that is zero minus | |
| # the number of iterations remaining. So when it finally | |
| # reaches zero, we're done. It also put a number underneath | |
| # that on the return stack from which you can recover the | |
| # iteration counter. | |
| # This scheme is mostly due to F-83, except that in F-83 it | |
| # reached 0x8000 instead of 0, which seemed perverse to me. | |
| # This isn't strictly necessary as part of the minimal primitive | |
| # set, but it seemed like it would make inner loops maybe ten | |
| # times faster, and as with most words that do return-stack | |
| # manipulation, the size penalty is actually negative. (This | |
| # version is 14 bytes; in bytecode it would be 17.) | |
| # ( -- ) ( R: -1 adjustment -- ) in end-of-loop case, and skip | |
| # the interpreter pointer over the jump offset. | |
| # ( -- ) ( R: counter -- counter+1 ) in normal case, and adjust | |
| # interpreter pointer by number of bytes stored after call to | |
| # this routine. | |
| defasm _loop, "(loop)" | |
| xor %eax, %eax # mov $1, %eax is 5 bytes | |
| inc %eax | |
| add %eax, (%ebp) | |
| jnc branch # if no carry, go branch | |
| # If there was a carry, we're done! | |
| add $8, %ebp # drop loop-sys from rstack | |
| lodsb # skip jump offset | |
| jmp next | |
| # Store a cell. | |
| defasm bang, "!" | |
| pop %ebx | |
| pop (%ebx) # I'm amazed this is legal | |
| jmp next | |
| # Fetch a cell. | |
| defasm at, "@" | |
| pop %ebx | |
| push (%ebx) # I'm amazed this is legal too | |
| jmp next | |
| # Store a byte. | |
| defasm c_bang, "c!" | |
| pop %ebx | |
| pop %eax | |
| mov %al, (%ebx) # push and pop don't do bytes | |
| jmp next | |
| # Fetch a byte. | |
| defasm c_at, "c@" | |
| pop %ebx | |
| xor %eax, %eax | |
| mov (%ebx), %al | |
| jmp pusheax | |
| # Get the return stack pointer. | |
| defasm rp_at, "rp@" | |
| push %ebp | |
| jmp next | |
| # Set the return stack pointer. | |
| defasm rp_bang, "rp!" | |
| pop %ebp | |
| jmp next | |
| # Pop the return stack to the data stack | |
| defasm rpop, "r>" | |
| xchg %esp, %ebp | |
| pop %eax | |
| xchg %esp, %ebp | |
| jmp pusheax | |
| # Push the return stack from the data stack | |
| defasm rpush, ">r" | |
| lea -4(%ebp), %ebp | |
| pop (%ebp) | |
| jmp next | |
| # Get the data stack pointer (before it gets pushed). | |
| defasm sp_at, "sp@" | |
| push %esp # safe on 286 and later | |
| jmp next | |
| # Set the data stack pointer. | |
| defasm sp_bang, "sp!" | |
| pop %esp | |
| jmp next | |
| # Pop the stack. | |
| defasm drop, "pop" | |
| pop %eax | |
| jmp next | |
| # Push a copy of TOS. | |
| # eForth 1.0 used BX to index the stack here, for a couple of | |
| # reasons: on the 8086, SP got decremented prior to the fetch, | |
| # and also wasn't valid as a base or index register. | |
| defasm dup, "dup" | |
| pop %eax | |
| push %eax | |
| jmp pusheax | |
| # Stack manipulation ( w1 w2 -- w1 w2 w1 ) | |
| # technically not necessary, but it's so easy and tiny | |
| defasm over, "over" | |
| push 4(%esp) | |
| jmp next | |
| # Swap top two stack items ("exch" in PostScript) | |
| defasm swap, "swap" | |
| pop %edx | |
| pop %eax | |
| # jmp pushedxeax fall through because pushedxeax is next | |
| # pusheax and pushedxeax: a prologue to 'next' that first pushes %edx | |
| # and %eax, or just %eax. | |
| # For a net savings of 13 bytes, last I checked, in all those | |
| # primitives that finish up by pushing something! Clever trick from | |
| # F-83's 1PUSH and 2PUSH. | |
| pushedxeax: | |
| push %edx | |
| pusheax: | |
| push %eax | |
| # now we fall through to 'next' | |
| # "next" fetches the next bytecode and runs it. It's placed | |
| # here in the middle of the bytecode definitions so that more | |
| # of them can use the short two-byte jump form to get to it. | |
| next: | |
| xor %eax, %eax # set %eax to 0 | |
| xor %ebx, %ebx # clear high half of %ebx | |
| lodsb # load %al from where %esi points | |
| # (%esi is the interpreter pointer) | |
| execute_eax: | |
| ## load offset of new word into %ebx | |
| mov token_table(,%eax,2), %bx # bx := token_table[eax * 2bytes] | |
| cmp $last_asm_bytecode, %eax | |
| jbe next_primitive # if primitive, handle primitive word | |
| ## otherwise, handle a bytecode definition or "colon list" | |
| # save old %esi on return stack | |
| xchg %ebp, %esp | |
| push %esi | |
| xchg %ebp, %esp | |
| lea bytecode_start(%ebx), %esi | |
| jmp next | |
| next_primitive: | |
| lea machine_code_primitives(%ebx), %ebx | |
| jmp *%ebx | |
| # Push true if n negative. ( n -- f ) | |
| defasm negative, "0<" | |
| pop %eax | |
| cdq | |
| push %edx | |
| jmp next | |
| # Bitwise operators: | |
| defasm and, "&" | |
| pop %eax | |
| pop %ebx | |
| and %ebx, %eax | |
| jmp pusheax | |
| defasm or, "|" | |
| pop %eax | |
| pop %ebx | |
| or %ebx, %eax | |
| jmp pusheax | |
| defasm xor, "^" | |
| pop %eax | |
| pop %ebx | |
| xor %ebx, %eax | |
| jmp pusheax | |
| # add two unsigned numbers, returning sum and carry. | |
| # ( u1 u2 -- u3 cy ) | |
| defasm umplus, "um+" | |
| xor %eax, %eax | |
| pop %edx | |
| pop %ebx | |
| add %ebx, %edx | |
| rcl $1, %eax | |
| jmp pushedxeax | |
| # Divide double-precision by single-precision, unsigned (?). | |
| # UM/MOD from eForth. ( udl udh un -- ur uq ) | |
| defasm divmod, "/%" | |
| pop %ebx | |
| pop %edx | |
| pop %eax | |
| idiv %ebx | |
| jmp pushedxeax | |
| # Multiply two single-precision numbers, giving a double- | |
| # precision result. ( d1 d2 -- udl udh ) | |
| defasm mmul, "um*" | |
| pop %eax | |
| pop %ebx | |
| imul %ebx | |
| push %eax | |
| push %edx | |
| jmp next | |
| # Copy the top of the return stack onto the data stack. | |
| defasm r_at, "r@" | |
| push (%ebp) | |
| jmp next | |
| # syscall5: | |
| # Linux system call with up to 5 arguments | |
| # This is no longer the fashionable way to make system calls | |
| # in Linux. Now you're supposed to use SYSENTER on newer | |
| # CPUs, and rather than have you figure out which one to use, | |
| # the kernel mmaps a chunk of code called a VDSO into your | |
| # memory space at a random address and tells you where to | |
| # find it using the ELF auxiliary vector. Then you're | |
| # supposed to invoke the dynamic linker or something to parse | |
| # the ELF executable mysteriously manifested in this way by | |
| # the kernel, and then resolve an undefined symbol in libc | |
| # into calls to it. See "What is linux-gate.so.1?" | |
| # http://www.trilithium.com/johan/2005/08/linux-gate/ | |
| # "The Linux kernel: System Calls" by Andries Brouwer, 2003-02-01 | |
| # http://www.win.tue.nl/%7Eaeb/linux/lk/lk-4.html | |
| # "About ELF Auxiliary Vectors" by Manu Garg | |
| # http://manugarg.googlepages.com/aboutelfauxiliaryvectors | |
| # But the old int $0x80 approach still works, thank goodness, | |
| # because all of that is *way* more than these ten | |
| # instructions. | |
| defasm syscall5, "syscall5" | |
| pop %edi | |
| ## we have to save %esi for the interpreter | |
| mov %esi, -4(%ebp) | |
| pop %esi | |
| pop %edx | |
| pop %ecx | |
| pop %ebx | |
| pop %eax | |
| int $0x80 | |
| mov -4(%ebp), %esi | |
| jmp pusheax | |
| last_asm_bytecode = b_syscall5 | |
| ### Basic Interpreted Words | |
| ## a macro for defining interpreted words | |
| ## Because after I left off b_exit once, I wasted a long | |
| ## time trying to figure out what was wrong, so I use this when I can: | |
| .macro def name, realname, bytes:vararg | |
| defbytes \name, "\realname" | |
| .byte \bytes | |
| .byte b_exit | |
| .endm | |
| ## Macros for conditional branch and loop: | |
| ## Because I am tired of tracking down bugs due to | |
| ## getting the jump offsets wrong. | |
| .macro fif, target # if, or end of while loop | |
| .byte b_branch_on_0, \target - . - 1 | |
| .endm | |
| .macro floop, target # do loop | |
| .byte b__loop, \target - . - 1 | |
| .endm | |
| .macro felse, target # else, unconditional jump | |
| .byte b_branch, \target - . - 1 | |
| .endm | |
| .data 2 # separate subsection from token table | |
| bytecode_start: | |
| # System call with three arguments. | |
| def syscall3, "syscall3", b_zero, b_zero, b_syscall5 | |
| # System call with one argument. | |
| def syscall1, "syscall1", b_zero, b_zero, b_syscall3 | |
| def bye, "bye", b_dolit_s8,__NR_exit, b_zero, b_syscall1 # exit program | |
| def zero, "0", b_dolit_s8,0 # push 0 | |
| def one, "1", b_dolit_s8,1 | |
| # This word outputs a string whose address and count are on | |
| # the stack. ( b u -- ) | |
| defbytes type, "type" | |
| .byte b_rpush, b_rpush # move two args onto rstack | |
| # system call is __NR_write: | |
| .byte b_dolit_s8,__NR_write | |
| .byte b_one # push constant 1: stdout | |
| .byte b_rpop, b_rpop # move two args back from rstack | |
| .byte b_syscall3 # call syscall with 3 args | |
| .byte b_drop # discard return value | |
| .byte b_exit # return | |
| # The next few words exist just to poke string addresses | |
| # and lengths onto the stack so "type" can print them. | |
| .macro def_counted_string name, contents | |
| defbytes \name, "\name" | |
| .byte b_dolit_32 # dolit_32 pushes a 32-bit | |
| .int string_\name # literal --- an addr, here | |
| # now push literal length and return | |
| .byte b_count, b_exit | |
| .pushsection .rodata # define the actual string: | |
| string_\name: | |
| countedstring "\contents" | |
| .popsection | |
| .endm | |
| def_counted_string hello, "hello" | |
| def_counted_string world, "world" | |
| def_counted_string comma, ", " | |
| # convert a counted string in memory to an address and | |
| # count on the stack | |
| def count, "count", b_dup, b_add1, b_swap, b_c_at | |
| def cr, "cr", b_dolit_s8, '\n, b_emit | |
| ### Some More Basic Words | |
| def neg1, "-1", b_dolit_s8, -1 # ( -- -1 ) | |
| def add, "+", b_umplus, b_drop # ( a b -- a+b ) drop the carry | |
| def sub1, "1-", b_neg1, b_add # ( n -- n-1 ) | |
| def rot, "rot", b_rpush, b_swap, b_rpop, b_swap # ( a b c -- b c a ) | |
| def unrot, "-rot", b_rot, b_rot # ( a b c -- c a b ) | |
| def tuck, "tuck", b_dup, b_unrot # ( a b -- b a b ) | |
| # emit: output a single byte. eForth calls this "TX!". | |
| # This version is 11 bytes, including the buffer byte, plus the 2-byte | |
| # token table pointer. a machine-code version I wrote the other day | |
| # was 28 bytes. However, I also added rot, unrot, and tuck to support | |
| # this function, and they total 11 bytes, plus 6 bytes of overhead. | |
| # For a total of 11+2+11+6 = 30 bytes. Not winning yet on size over | |
| # x86 asm! But we're getting close. | |
| emit_buffer: | |
| .byte 0 | |
| defbytes emit, "emit" | |
| .byte b_dolit_32 | |
| .int emit_buffer | |
| .byte b_tuck # save a copy of address for b_type | |
| .byte b_c_bang # store into emit buffer | |
| .byte b_one, b_type, b_exit # output one-byte buffer | |
| ### "u." prints out an unsigned number. | |
| # I had a version of this in x86 machine code in 52 bytes (23 | |
| # instructions), essentially exactly the same code as here. | |
| # This is 31 bytes, plus 6 bytes of overhead, plus I had | |
| # to define b_divmod (9 bytes plus 2 bytes overhead). Now we are | |
| # starting to win! | |
| defbytes udot, "u." # print space after number | |
| .byte b_udot_nospc, b_dolit_s8, 0x20, b_emit, b_exit | |
| defbytes udot_nospc, "(u.)" # print number without space | |
| .byte b_dup | |
| fif 1f | |
| .byte b_udot_nonzero, b_exit | |
| 1: .byte b_drop, b_dolit_s8, '0, b_emit, b_exit | |
| defbytes udot_nonzero, "((u.))" | |
| .byte b_zero, b_dolit_s8,10, b_divmod # divide by 10 | |
| .byte b_dup | |
| fif 2f # recurse if nonzero | |
| .byte b_udot_nonzero | |
| felse 3f | |
| 2: .byte b_drop # drop zero quotient | |
| 3: .byte b_dolit_s8, '0, b_add, b_emit # print digit | |
| .byte b_exit | |
| ### Add signed numeric output, ".". This cost 20 bytes plus 8 bytes | |
| # of overhead, but added some fundamental numeric operations; only 12 | |
| # of those 28 bytes are specific to "." | |
| # logical bitwise not | |
| def invert, "~", b_dolit_s8, -1, b_xor | |
| def add1, "1+", b_one, b_add | |
| # arithmetic negation | |
| def negate, "negate", b_invert, b_add1 | |
| # print signed number | |
| defbytes dot, "." | |
| .byte b_dup, b_negative | |
| fif 1f | |
| .byte b_dolit_s8, '-, b_emit, b_negate # in the negative case | |
| 1: .byte b_udot, b_exit | |
| ### Obviously the next thing to do is to add ".S", print the | |
| # stack, so that I can stop having to investigate problems by | |
| # using gdb. | |
| # The bytecode for this consumed 78 bytes in 12 words, plus a | |
| # new 8-byte primitive (mmul) and a new 14-byte primitive | |
| # (_loop), plus six bytes in the initialization routine, for 28 | |
| # bytes of overhead and a total of 78+28+14+8+6 = 134 bytes. | |
| # This is definitely not a size win over machine code! Machine | |
| # code would only be 22 bytes in 7 instructions, if there were a | |
| # way to just CALL the "." routine from machine code, which | |
| # there isn't. | |
| # However, the words added were cells * - / cellsize nip (do) | |
| # (loop) 2dup which are all generally useful, and | |
| # depth pick .s which are more special-purpose. | |
| # The special-purpose words are 35 out of those 78 bytes. | |
| # PRINTSTACK itself is only 15 bytes, and there's hope that the | |
| # 6 bytes of PICK and the 14 bytes of DEPTH will be useful in | |
| # other debugging routines. | |
| # I'm not happy with (do) and (loop), only because (do) | |
| # implements dpANS DO, not dpANS ?DO, so it loops many times | |
| # when it should loop zero times; | |
| def_counted_string scolon, "s: " | |
| defbytes printstack, ".s" | |
| .byte b_scolon, b_type | |
| .byte b_depth, b_zero # loop limits | |
| .byte b_twodup, b_xor | |
| fif 1f # skip loop if stack empty | |
| .byte b__do | |
| 2: .byte b_i, b_pick, b_dot # DO I PICK . LOOP | |
| floop 2b | |
| 1: .byte b_cr, b_exit | |
| def pick, "pick", b_add1, b_cells, b_sp_at, b_add, b_at | |
| def cells, "cells", b_cellsize, b_mul | |
| def mul, "*", b_mmul, b_drop # drop upper 32 bits of multiplication result | |
| bottom_of_stack: | |
| .int 0 | |
| defbytes depth, "depth" | |
| .byte b_sp_at, b_dolit_32 | |
| .int bottom_of_stack | |
| .byte b_at, b_swap, b_sub, b_zero, b_cellsize, b_div, b_exit | |
| def sub, "-", b_negate, b_add # subtract ( a b -- a-b ) | |
| def div, "/", b_divmod, b_nip # int divide ( ul uh n -- quotient ) | |
| def cellsize, "cellsize", b_dolit_s8,4 # ( -- 4 ) | |
| def nip, "nip", b_swap, b_drop # stack manipulation ( a b -- b ) | |
| # 10 0 DO ... LOOP loops 0, 1...9. | |
| # _do sets up return stack for _loop | |
| # similar to F83: ( limit initial -- ) ( R: X -- X initial-limit limit ) | |
| defbytes _do, "(do)" | |
| .byte b_over, b_sub, b_swap, b_rpop, b_swap, b_rpush | |
| .byte b_swap, b_rpush, b_rpush, b_exit | |
| ## return loop counter | |
| def i, "i", b_rpop, b_rpop, b_r_at, b_over, b_rpush, b_add, b_swap, b_rpush | |
| def twodup, "2dup", b_over, b_over | |
| def twodrop, "2drop", b_drop, b_drop | |
| # Now some stuff for dealing with the dictionary. | |
| # This stuff was from 804930a to 8049396, 140 bytes. In that we got: | |
| # - new words: dict dictp dictsize nextword pastdict? words < >= 0= | |
| # cbcmp c@+ bcmp memcmp unloop find r2@ 2swap | |
| # - less concretely: | |
| # - the ability to list of words in the dictionary; | |
| # - the ability to find words in the dictionary; | |
| # - <, >=, and 0= numerical comparisons; | |
| # - cbcmp, bcmp, and memcmp memory manipulations; | |
| # - 2swap and r2@ stack manipulations; | |
| # - unloop loop control; | |
| # - c@+ for iterating over memory. | |
| # That's 17 new words, averaging 8.2 bytecodes each. | |
| dictionary_pointer: | |
| .int end_of_dictionary | |
| defbytes dict, "dict" | |
| .byte b_dolit_32 | |
| .int dictionary | |
| .byte b_exit | |
| defbytes dictp, "dictp" | |
| .byte b_dolit_32 | |
| .int dictionary_pointer | |
| .byte b_exit | |
| def dictsize, "dictsize", b_dictp, b_at, b_dict, b_sub | |
| def nextword, "nextword", b_dup, b_c_at, b_add, b_add1 | |
| def pastdict, "pastdict?", b_dictp, b_at, b_ge | |
| defbytes words, "words" | |
| .byte b_dict | |
| 1: .byte b_dup, b_count, b_type, b_dolit_s8,32, b_emit | |
| .byte b_nextword | |
| .byte b_dup, b_pastdict | |
| fif 1b | |
| .byte b_exit | |
| def lt, "<", b_sub, b_negative | |
| def ge, ">=", b_lt, b_zeq | |
| # logical not: return true for 0, false (0) otherwise | |
| defbytes zeq, "0=" | |
| fif 1f | |
| .byte b_zero, b_exit | |
| 1: .byte b_neg1, b_exit | |
| # To find a word in the dictionary: | |
| # - move the word onto the return stack, and get the dictionary pointer | |
| # - then loop: | |
| # - see if the word is at the current place | |
| # - if so, clean up and return that place | |
| # - otherwise, go to the next word | |
| # - and repeat if we're still in the dictionary | |
| # - then clean up the stacks and return 0 | |
| # Tells whether a counted string equals an address-and-count string. | |
| # 0 for equal, nonzero for unequal. | |
| # ( c-addr1 c-addr2 u -- n ) | |
| def cbcmp, "cbcmp", b_rot, b_count, b_twoswap, b_bcmp | |
| # like F21 @A+: ( c-addr -- c-addr+1 char ) | |
| def c_at_inc, "c@+", b_dup, b_add1, b_swap, b_c_at | |
| # Keep in mind memcmp() in libc is only 30 bytes long. | |
| # This bcmp is a little different from C memcmp or bcmp in | |
| # that it compares two lengths. | |
| defbytes bcmp, "bcmp" # ( c-addr1 u1 c-addr2 u2 -- n ) | |
| .byte b_rot, b_over, b_xor | |
| fif 3f | |
| .byte b_twodrop, b_drop, b_one, b_exit | |
| 3: .byte b_memcmp, b_exit | |
| defbytes memcmp, "memcmp" # ( c-addr1 c-addr2 u -- n ) | |
| .byte b_zero, b__do | |
| 2: .byte b_c_at_inc, b_rot, b_c_at_inc, b_rot | |
| .byte b_sub, b_dup # - dup if | |
| fif 1f | |
| .byte b_unrot, b_twodrop, b_unloop, b_exit | |
| 1: .byte b_drop, b_swap | |
| floop 2b | |
| .byte b_twodrop, b_zero, b_exit | |
| # this should probably go with the other do loop stuff | |
| def unloop, "unloop", b_rpop, b_rpop, b_rpop, b_twodrop, b_rpush | |
| # FIND: ( c-addr u -- token 1 | 0 ) | |
| # 30 bytes; jonesforth 42's asm version is 56 bytes. It's | |
| # fairly directly comparable, although jonesforth's FIND has | |
| # to do bit-masking and includes its own inline NEXTWORD and | |
| # CBCMP, which are actually fairly large here. | |
| defbytes find, "find" | |
| .byte b_rpush, b_rpush, b_zero, b_dict | |
| 1: .byte b_dup, b_r_2at, b_cbcmp, b_zeq # start loop | |
| fif 2f # bcmp 0= if | |
| .byte b_rpop, b_rpop, b_twodrop, b_drop, b_one, b_exit | |
| 2: .byte b_swap, b_add1, b_swap, b_nextword, b_dup, b_pastdict | |
| fif 1b | |
| .byte b_rpop, b_rpop, b_twodrop, b_twodrop, b_zero, b_exit | |
| # copy two cells from return stack | |
| defbytes r_2at, "r2@" | |
| .byte b_rpop, b_rpop, b_r_at, b_over, b_rpush, b_rot, b_rpush, b_exit | |
| # ( a b c d -- c d a b ) | |
| def twoswap, "2swap", b_rpush, b_unrot, b_rpop, b_unrot | |
| .macro create, name | |
| defbytes \name, "\name" | |
| .byte b_dolit_32 | |
| .int 1f | |
| .byte b_exit | |
| 1: | |
| .endm | |
| create "tib" | |
| _tibmax = 80 | |
| .space _tibmax | |
| defbytes tibmax, "tibmax" | |
| .byte b_dolit_32 | |
| .int _tibmax | |
| .byte b_exit | |
| create "tibsize" | |
| .int 0 | |
| defbytes fgets, "fgets" | |
| .byte b_tib, b_tibmax, b_read, b_tibsize, b_bang # XXX handle errors | |
| .byte b_tib, b_tibsize, b_at, b_exit | |
| def gets, "gets", b_zero, b_fgets | |
| defbytes read, "read" | |
| .byte b_rpush, b_rpush, b_dolit_s8,__NR_read | |
| .byte b_zero # fd 0: stdin | |
| .byte b_rpop, b_rpop, b_syscall3, b_exit | |
| def bl, "bl", b_dolit_s8,32 # space, blank | |
| # parse parses out a token of input from a string and leaves | |
| # the token's address and length atop the stack | |
| # ( c-addr u -- c-addr+n u-n c-addr2 u2 ) | |
| defbytes parse, "parse" | |
| .byte b_skipwhitespace | |
| ## XXX finish him! | |
| .byte b_exit | |
| defbytes skipwhitespace, "-wsp" | |
| 1: .byte b_dup, b_zeq | |
| fif 3f # return empty tail | |
| .byte b_exit | |
| 3: .byte b_sub1, b_swap, b_c_at_inc, b_whitespace | |
| fif 2f # escape loop if not whitespace | |
| .byte b_swap | |
| felse 1b | |
| 2: .byte b_sub1, b_swap, b_add1, b_exit | |
| ## costs 9+5 bytes | |
| def_counted_string wsps, " \n\t" | |
| defbytes whitespace, "wsp" | |
| .byte b_dup, b_bl, b_xor | |
| fif 1f | |
| .byte b_dup, b_dolit_s8,'\n, b_xor | |
| fif 1f | |
| .byte b_dup, b_dolit_s8,'\t, b_xor | |
| fif 1f | |
| .byte b_drop, b_zero, b_exit | |
| 1: .byte b_drop, b_one, b_exit | |
| # def repl, "repl", b_gets, b_eval, b_exit | |
| # defbytes eval, "eval" | |
| # 2: .byte b_parse | |
| # .byte b_dup, b_zeq | |
| # fif 3f # escape from loop | |
| # .byte b_rot, b_rpush, b_rot, b_rpush, b_find | |
| # fif 1f | |
| # .byte b_execute | |
| # 1: .byte b_rpop, b_rpop | |
| # felse 2b | |
| # 3: .byte b_2drop, b_2drop, b_exit | |
| .data 3 | |
| instructions: | |
| # And here is the actual "main program" in that bytecode. | |
| .byte b_sp_at, b_dolit_32 | |
| .int bottom_of_stack # variable to remember initial stack bottom | |
| .byte b_bang # initialize that variable | |
| .byte b_hello # string "hello" and count | |
| .byte b_sub1 # subtract 1 from count: "hell" | |
| .byte b_type # spit it out | |
| .byte b_comma, b_type, b_world, b_type # ", world" | |
| .byte b_comma, b_type, b_hello, b_type, b_cr | |
| # test the "dot" command to print out numbers | |
| .byte b_dolit_s8, -120, b_dot | |
| # test positive numbers and "depth" command | |
| .byte b_dolit_s8, 104, b_depth, b_dot, b_dot, b_cr | |
| # test printstack | |
| .byte b_dolit_s8,100, b_dolit_s8,101, b_dolit_s8,102, b_printstack | |
| # test dictsize | |
| .byte b_dictsize, b_dot, b_cr | |
| .byte b_words, b_cr | |
| .byte b_hello, b_twodup, b_type, b_find, b_printstack | |
| .byte b_comma, b_twodup, b_type, b_find, b_printstack | |
| .byte b_dict, b_dot, b_cr | |
| .byte b_dolit_s8, '?, b_emit, b_bl, b_emit, b_gets, b_twodup | |
| .byte b_zero, b__do | |
| 1: .byte b_c_at_inc, b_whitespace, b_dot | |
| floop 1b | |
| .byte b_drop | |
| .byte b_skipwhitespace, b_type | |
| .byte b_bye | |
| # At end of the assembly program, we initialize the | |
| # end_of_dictionary pointer by putting it at the end of the | |
| # assembled .dictionary section: | |
| .section .dictionary | |
| end_of_dictionary: |