Skip to content

jorenvo/jvo-asm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

jvo-asm https://travis-ci.com/jorenvo/jvo-asm.svg?branch=master

This is a toy x86 assembler written from scratch. It was written to gain a better understanding of how machine code and executable files work. Its syntax uses a lot of emojis because why not?

Usage

Using the print example:

$ cargo run -- examples/print.jas
hi!

Features

Constants

🖊LINUX_SYSCALL $128
# ...
❗ LINUX_SYSCALL

Comments

# I'm a comment
🦘= ✉exit

Addressing

Immediate addressing

⚫ ⬅ $8

Load 8 into ⚫.

Register addressing

🔴 ⬅ 🔵

Copies data from 🔵 into 🔴.

Direct addressing

📗my_number 3
# ...
🔴 ⬅ my_number

This loads 3 into 🔴.

Indirect addressing

🔴 ⬅ $0~🔵

This loads the value at the address contained in 🔵 into 🔴.

Base pointer addressing

🔴 ⬅ $4~🔵

Or alternatively with a constant:

🖊ST_ARG $8
# ...
🔴 ⬅ ST_ARG~🔵

This is similar to indirect addressing except that it adds a constant offset to the address in 🔵.

Labels

🦘 ✉exit
# ...
📪exit:
⚪ ⬅ $1
❗ LINUX_SYSCALL

Labels are defined by prefixing them with 📪 and ending them with a :. To refer to a label prefix it with ✉ instead.

Data sections

📗numbers 3, 67, 34, 222, 45
# ...
🔵 ⬅ numbers

Data sections start with 📗 and can be referred to later by just their name.

Implementation notes

The main high-level function which processes a file is process. First the code is broken up into separate lines. Each line is then tokenized into a vector of TokenType. ConstantReferences are replaced by their constants and the vector is compiled into a vector of IntermediateCode. Intermediate code consists of bytes and displacements. We need this intermediate step because e.g. a jump to an instruction further down the program can not be encoded, when we encounter a jump to a next instruction we don’t know yet how far to jump. After this we iterate through the IntermediateCode and replace the displacements with bytes. This is done by keeping track of the byte offset of each instruction in the program during the first step.

After this an ELF binary is built. Its layout is as follows (the multiple data sections example was used here):

$ readelf -a a.out
ELF Header:
  Magic:   7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00
  Class:                             ELF32
  Data:                              2's complement, little endian
  Version:                           1 (current)
  OS/ABI:                            UNIX - System V
  ABI Version:                       0
  Type:                              EXEC (Executable file)
  Machine:                           Intel 80386
  Version:                           0x1
  Entry point address:               0x804b000
  Start of program headers:          52 (bytes into file)
  Start of section headers:          148 (bytes into file)
  Flags:                             0x0
  Size of this header:               52 (bytes)
  Size of program headers:           32 (bytes)
  Number of program headers:         3
  Size of section headers:           40 (bytes)
  Number of section headers:         5
  Section header string table index: 4

Section Headers:
  [Nr] Name              Type            Addr     Off    Size   ES Flg Lk Inf Al
  [ 0]                   NULL            00000000 000000 000000 00      0   0  0
  [ 1] pi                PROGBITS        08049000 001000 000014 00  WA  0   0  1
  [ 2] euler             PROGBITS        0804a000 002000 000014 00  WA  0   0  1
  [ 3] .code             PROGBITS        0804b000 003000 000019 00  AX  0   0  1
  [ 4] .shstrtab         STRTAB          00000000 000400 00001a 00      0   0  1

...

Program Headers:
  Type           Offset   VirtAddr   PhysAddr   FileSiz MemSiz  Flg Align
  LOAD           0x003000 0x0804b000 0x0804b000 0x00019 0x00019 R E 0x1000
  LOAD           0x001000 0x08049000 0x08049000 0x00014 0x00014 RW  0x1000
  LOAD           0x002000 0x0804a000 0x0804a000 0x00014 0x00014 RW  0x1000

 Section to Segment mapping:
  Segment Sections...
   00     .code
   01     pi
   02     euler

...

There’s a program header entry for each data section (📗) and for the executable code. Everything is padded to 4 KB (=virtual page size). To allow for linking a correct section header is also generated.

Instruction reference

Registers

SymbolName
%eax
🔴%ebx
🔵%ecx
%edx
%esp
%ebp

Instructions

SymbolExampleDescription
Return from a function
📞📞 fnCall function
⚪ ➕ ⚫⚪ += ⚫
⚪ ➖ ⚫⚪ -= ⚫
⚪ ✖ ⚫⚪ *= ⚫
🔴 ⬅ $1Move into register
❗ $128Interrupt
⚖ ⚫, ⚪Compare ⚫ to ⚪
🦘=🦘= ✉exitJump if equal
🦘≠🦘≠ ✉exitJump if not equal
🦘<🦘< ✉exitJump if less than
🦘≤🦘≤ ✉exitJump if less or equal
🦘>🦘> ✉exitJump if greater than
🦘≥🦘≥ ✉exitJump if greater or equal
🦘🦘 ✉exitUnconditional jump
📥📥 $8Push onto stack
📤📤 🔵Pop from stack
🖊🖊c $4Define constant c to be 4
📪 (ends with :)📪exit:Define a label with name exit
📗📗pi 3, 1, 4Define a data section pi containing 3 integers
✉exitRefer to a previously defined (📪) exit label
$$11 is a number
## hi!hi! is a comment
[0-9]+11 is a memory address
[aA-zZ]+constantconstant is a previously defined (🖊, 📗) constant

About

x86 assembler from scratch.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published