Strongly typed, C-like systems programming language built for resource-constrained 8-bit microprocessors.
- Getting Started: Key Features & Architecture
- Current Status & Limitations
- Toolchain Usage
- Language Specifications
- Binary Layout
- Third-Party Licenses
- Source Tracking Tokenizer: Maps characters to discrete tokens while maintaining source locations (file, line, column) for robust compilation errors.
- Recursive Descent Parser: Transforms the token stream into a structured AST, treating hardware registers and standard controls as first-class grammatical constructs.
- Lexically Scoped Semantic Analyzer: Two-pass validation engine over the AST. Pass 1 registers all top-level declarations (functions, structs, registers, globals) into the global symbol table. Pass 2 walks function bodies with a scoped symbol table, checking undeclared identifiers, type mismatches, argument counts/types, struct field access, lvalue validity, and return-type consistency. Invalid declarations are poisoned to prevent cascading diagnostics.
- IR Generator: Lowers the analysed AST into a self-contained three-address code (TAC) intermediate representation. The IR module contains struct layouts with computed field offsets, global/register definitions with hardware addresses baked in, and one flat instruction stream per function - codegen can emit target code from the IR alone, without consulting the AST or symbol table. Supports incremental compilation:
-cserializes the IR to a.ofile that can be loaded back to skip the frontend entirely. - Code Generator: Emits valid 65C02 ROM binaries (32K) with a bootstrap runtime, interrupt vectors, and flat zero-page register allocation. Avoids slow stack-based execution by mapping local variables, temporaries, and parameters directly onto zero-page slots. Globals are allocated in RAM ($0200+) and initialized in the bootstrap before
JSR main. String literals are placed in a ROM data section with backpatching fixups. Supports arithmetic (+,-, unary-) for all integer types (u8/i8/u16/i16), comparisons across all widths and signedness (unsigned via carry-flag, signed via N⊕V), pointer dereference, and function calls. Function calls use a fixed 2-byte-per-param ABI zone ($EF–$FE) for parameter passing; a callee-saves convention (PHA/PLA on all ZP slots) preserves the caller's locals across calls and enables bounded recursion. All emit paths are bounds-checked against the 32 KB ROM limit — programs that overflow produce a clear diagnostic rather than silent corruption. Compiler implicit globals (__heap_start,__memory_top) are injected automatically and initialized during bootstrap. Programs compile and run on real hardware.
- Disassembler: Decodes compiled
.binfiles back into annotated 65C02 assembly, resolving jump targets to named labels for readability. Supports section-aware output (.text/.datasplit), hex dumps with ASCII, and ROM usage summaries. See c02-objdump for more information.
C02 has reached its v1.0 milestone — the complete single-file language, per docs/roadmap.md's checklist. The complete frontend (tokenizer, parser, semantic analyzer), IR generation, and code generator are functional and tested — non-trivial programs compile to valid 65C02 ROMs and run on real hardware without hitting an "unimplemented" wall. Development continues toward v1.1+ (interrupt handlers, inline assembly, multi-file linking, optimization passes — see docs/roadmap.md).
- Data movement: variable copies, constant stores, hardware register writes. Implicit widening (u8→u16) zero-extends correctly; narrowing copies the low bytes.
- Control flow:
if/else,while,forloops via label/jump/conditional-jump, plusbreak/continueinside either loop form. - Arithmetic:
+,-, unary-,*,/,%for all integer types (u8, i8, u16, i16). Width-aware multi-byte emission for 16-bit operations with carry/borrow propagation. Multiply and divide via__mul8/__div8software subroutines. - Bitwise & shift ops:
&,|,^,~,<<,>>for all widths. Signed right shift uses the carry-from-sign-bit pattern for correct arithmetic extension. - Comparisons: all six relational operators (
<,<=,==,!=,>=,>) for all widths (u8, u16) and signedness (unsigned via carry-flag, signed via N⊕V). 16-bit comparisons use a high-byte-first pattern. - Increment/decrement:
++/--for both u8 and 16-bit values (pointers, u16), including globals and struct fields. - Pointer dereference & store:
*preads viaLDA ($nn),Y;*p = valwrites viaSTA ($nn),Y. Both work for local and global pointer variables. - Pointer arithmetic:
ptr + intandptr - intproduce a pointer of the same type, enabling*(msg + i)-style indexed access. - Address-of:
&xresolves to the variable's ZP slot address for locals or its RAM address for globals, stored as a 16-bit pointer. - Type casts:
(type)expr— widening zero/sign-extends, narrowing copies low bytes. - Struct field access:
s.fieldandptr.field(auto-deref) for both local and global structs. Field reads and writes work for by-value structs and pointer-to-struct, including++field/--field. - Global variables: RAM-allocated globals with bootstrap initialization, correctly accessed via absolute addressing throughout all codegen paths.
- String literals:
u8 *msg = "...";works both at global scope and as a local variable initializer inside a function body. String data is placed once in the ROM data section with backpatching fixups. - Compiler implicit globals:
__heap_startand__memory_topare injected automatically asdecl u16globals and initialized during the bootstrap.__heap_startholds the first free RAM byte after all user globals and the compiler implicit globals themselves are allocated (each takes 2 bytes of RAM) — useful as a base pointer for bump allocators.__memory_topholds the top of the general-purpose RAM region ($3FFF). Both are available in any.c02file without a manualdecl. - Function calls: full
JSR/RTSABI with up to 8 parameters passed through the$EF–$FEfixed-slot ABI zone. A callee-saves convention (PHA all ZP slots on entry, PLA in reverse on return) preserves caller locals across calls. Bounded recursion is supported — stack depth is limited to ≈256 / (function ZP byte count).
- Arrays — no array type or subscript syntax (
a[i]). Use pointer arithmetic (*(ptr + i)) in the meantime. - Compound bitwise/shift assignment —
&=,|=,^=,<<=,>>=are not yet supported; the arithmetic compound forms (+=,-=,*=,/=,%=) work. - Short-circuit
&&/||outside boolean context — short-circuit evaluation works correctly when used directly as a loop/if condition; using the result as a plain value in other expression contexts is not yet supported. - Missing-return detection is shallow. A non-void function with no
returnat the end is flagged, but the analyzer does not perform full path-coverage analysis. - Variable shadowing is disallowed, not silently supported — a declaration that reuses a name still visible from an enclosing scope is a compile error (codegen identifies variables by name only, so a shadowed name would alias its outer namesake's storage). Reusing a name across scopes that don't overlap (e.g. two sibling
forloops each declaring their owni) is unaffected.
If you're exploring the codebase: the parser (parser.c), the analyzer (analyzer.c), the IR generator (ir.c), and the code generator (generator.c) are the main files. Issues and PRs are welcome.
sudo apt install build-essential curl python3 python3-pip -y
# Official Rust install script (for c02-objdump)
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
# py65 6502 emulator (for runtime tests)
pip install py65
git clone https://github.com/jackwthake/C02.git
cd C02
makecc02 [OPTIONS] <FILE><FILE>: Input file (.c02source or.o/.outIR object)-h, --help: Show help message-c: Incremental compile - emit a.oIR object file instead of a final binary-o, --output: Specify output file--token-dump: Dump the token list after tokenization--ast-dump: Dump the AST after parsing--symbol-dump: Dump the global symbol table after analysis--ir-dump: Dump the IR (TAC instructions) after lowering--syntax-check-only: Stop after syntax and semantic checks--time-report: Print a report showing how long each stage of compilation took
Incremental compilation:
cc02 -c hello_world.c02 -o hello_world.o # compile to IR object
cc02 --ir-dump hello_world.o # inspect the IR from the object fileAll generated error messages are presented in a clang like format with concise source locations. The printed file locations use an editor-friendly format, enabling you to click to open the affected file.
The grammar below reflects what the tokenizer and parser currently accept. Semantic analysis validates the full AST after parsing, IR generation lowers it to TAC, and the code generator emits 65C02 machine code — see Getting Started and Current Status for what's working today.
u8/i8: 8-bit integers (unsigned / signed)u16/i16: 16-bit integers (unsigned / signed)void: Function return types with no payload.structnames: a bare identifier in type position resolves to a struct type (e.g.Point p;).- Pointer types: any base type followed by one or more
*(e.g.u8 *msg,u16 **pp).
// single-line comment
/*
block comment
*/A .c02 file is a sequence of top-level declarations: functions, reg declarations, struct declarations, global variables, and forward declarations (decl).
fn name(u8 a, u16 *b) -> void {
// body
}- Parameter list is
(type name, type name, ...), can be empty:(). - Return type is required, introduced with
->.
Hardware interface registers are pinned directly to absolute memory addresses.
reg u8 PORTA @ 0x6001;
reg u8 PORTB @ 0x6000;struct Point {
u8 x;
u8 y;
}- Body is a sequence of
type name;fields, no nested initialisers. - A trailing
;after the closing}is optional.
u8 *msg = "Hello C02!";
u16 counter;
Point origin;- Same form as a local variable declaration:
type name;ortype name = expr;. - Struct-typed globals are supported (
Point p;).
Forward declarations introduce the signature of a function or global defined in another translation unit, allowing cross-file references with incremental compilation (-c).
decl fn send_byte(u8 b) -> void;
decl u8 counter;- A
declfor a function uses the same signature syntax asfnbut has no body. - A
declfor a global isdecl type name;with no initialiser. - Redeclaring a name that already exists in the same file is an error.
The compiler automatically injects a small set of u16 globals that expose runtime memory layout information. No decl is needed — they are available in every translation unit.
| Name | Value | Description |
|---|---|---|
__heap_start |
first free RAM address after all globals | Base pointer for simple bump allocators. |
__memory_top |
$3FFF |
Top of the general-purpose RAM region. |
// variable declaration (local)
u8 x = 5;
Point p; // struct-typed declaration
p = Point{ .x = x, .y = 10 }; // struct with initializer
p = Point{}; // zero initialized struct
Point *p2; // or p2 = null; pointer to a Point struct, uninitialized
Point *p2 = &p; // pointer to a Point struct, initialized
// assignment (also: += -= *= /= %=)
x = x + 1;
x += 1;
// return
return;
return x;
// if / else if / else
if (x > 0) {
// ...
} else if (true) { // `true` and `false` are accepted keywords
// ...
} else {
// ...
}
// while
while (x < 10) {
x += 1;
}
// for (any of the three clauses may be empty)
for (u8 i = 0; i < 10; i += 1) {
// ...
}
// function call statement
do_thing(a, b);Precedence, lowest to highest:
|| && | ^ & == != < > <= >= << >> + - * / % (unary) (postfix)
- Unary (prefix):
!(logical not),-(negate),&(address-of),~(bitwise not),++/--,*and@(dereference). - Postfix:
.fieldfield access, chainable (a.b.c). Auto-dereferences struct pointers (ptr.fieldwhereptris aStruct*). - Calls:
name(arg1, arg2, ...). - Casts:
(type)expr, e.g.(u16)x. - Grouping:
(expr). - Literals: decimal/hex integers (
l_num), string literals (l_string), identifiers.
This program cycles LEDs connected to PORTB on a 65C02 breadboard — counting up from 0 to 255 and back down in an infinite loop. It compiles to a valid 32K ROM and runs on real hardware.
reg u8 PORTB @ 0x6000;
reg u8 DDRB @ 0x6002;
fn main() -> void {
DDRB = 0xFF; // Set all pins of PORTB as output
while(true) {
u8 i = 0;
for (; i < 255; ++i) {
PORTB = i;
}
PORTB = i;
for (; i > 0; --i) {
PORTB = i;
}
}
}cc02 led_counter.c02 -o led_counter.bin # compile to 32K ROM
c02-objdump led_counter.bin # disassemble to inspect the outputEvery compiled binary is a flat 32 KB ROM image ($8000–$FFFF) loaded at a fixed base address. The layout is always the same regardless of program size — unused space is filled with $EA (NOP). See memmap.md for more info on memory boundaries.
$0000 ┬─────────────────────────────────────────────
│ Zero Page (see ZP table below)
$0100 ├─────────────────────────────────────────────
│ Hardware stack (6502 fixed; $01FF = top)
$0200 ├─────────────────────────────────────────────
│ User globals (RAM_START; allocated upward by allocate_globals)
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌
│ __heap_start (u16, 2 bytes — compiler implicit)
│ __memory_top (u16, 2 bytes — compiler implicit)
├───────────────────────────────────────────── ← __heap_start value (first free byte)
│ (free for heap / dynamic use)
$3FFF ┴───────────────────────────────────────────── ← __memory_top value
To maximize compilation density and execution speed, the code generator reserves and maps lower RAM ($0000–$00FF, The Zero Page) to form a virtual register file:
| Address Range | Identifier | Purpose |
|---|---|---|
$00 |
FP |
Frame Pointer: Initialized to $01FF at startup. |
$02–$03 |
RET |
Return Register: Holds function return values (u8 in $02, u16 in $02:$03). |
$04–$DF |
r0–r219 |
Scratch Registers: Compiler-managed temporaries, locals, and globals. Allocated per-function from $04 upward, striding by type size (1 byte for u8/i8, 2 for u16/i16/pointers). |
$E0–$E7 |
— | 16-bit Arithmetic Helper Zone: Fixed slots for __mul16, __div16, __sdiv16 helpers. $E0:$E1 = arg1, $E2:$E3 = arg2, $E4:$E5 = result, $E6:$E7 = remainder. |
$E8–$EC |
— | 8-bit Arithmetic Helper Zone: Fixed slots for __mul8, __div8, __sdiv8 helpers. $E8 = arg1, $E9 = arg2, $EA = result, $EB = remainder, $EC = sign flags (bit 7 = negate quotient, bit 6 = negate remainder). |
$ED–$EE |
— | Reserved for future helpers. |
$EF–$FF |
a0–a7 |
Function ABI Zone: Fixed 2-byte slots for parameter passing. Caller populates before JSR; callee reads at entry. Supports up to 8 sixteen-bit parameters. |
$8000 ┬───────────────────────────────────────────── ← Reset vector target
│ Bootstrap (SEI · CLD · stack init · global init · JSR main · halt)
├─────────────────────────────────────────────
│ .text — function bodies (main first, then callees, then helpers)
├───────────────────────────────────────────── ← code/data boundary marker ($FFF8–$FFF9)
│ .data — null-terminated string literals
├─────────────────────────────────────────────
│ C02S symbol table (if --strip-debug not set)
| magic "C02S" · u16 count · [u16 addr · name\0] …
├─────────────────────────────────────────────
│ NOP fill ($EA bytes)
$FFF6 ├─────────────────────────────────────────────
│ Symbol table pointer (LE u16; $EAEA = absent)
$FFF8 ├─────────────────────────────────────────────
│ Code/data boundary marker (LE u16; first NOP-fill byte)
$FFFA ├─────────────────────────────────────────────
│ NMI vector (LE u16)
$FFFC ├─────────────────────────────────────────────
│ Reset vector (LE u16; always $8000)
$FFFE ├─────────────────────────────────────────────
│ IRQ vector (LE u16)
$FFFF ┴─────────────────────────────────────────────
The $FFF8–$FFF9 boundary word and the $FFF6–$FFF7 symbol-table pointer are read by c02-objdump to locate the .text/.data split and resolve function names. Older binaries that predate these fields have $EAEA at $FFF6 and are disassembled with auto-generated L0/L1/… labels as a fallback.
- Crafting Interpreters — the primary reference used throughout development.
- rui314/chibicc — structurally similar (recursive descent, etc.), found after starting this project; not directly followed, but worth a look.
| Dependency | License | Used By |
|---|---|---|
| clap | MIT / Apache-2.0 | c02-objdump CLI argument parsing |
| py65 | BSD | Test harness 65C02 emulator for runtime verification |
