Skip to content

signed compares, metatiles + collision, register-allocator pass#36

Merged
imjasonh merged 3 commits intomainfrom
claude/prioritize-allocator-signedness-tSyMZ
Apr 19, 2026
Merged

signed compares, metatiles + collision, register-allocator pass#36
imjasonh merged 3 commits intomainfrom
claude/prioritize-allocator-signedness-tSyMZ

Conversation

@imjasonh
Copy link
Copy Markdown
Owner

@imjasonh imjasonh commented Apr 19, 2026

Works through the top three items in the priority ranking at the bottom of docs/future-work.md.

1. Signedness on Cmp16/Cmp ops (§A follow-up)

Ordering compares (<, <=, >, >=) on signed integers now use the canonical 6502 CMP / SBC / BVC / EOR #$80 overflow-correction idiom so the N flag reflects the true sign of the difference. The lowerer tracks signedness on each IR temp (analogous to the existing wide_hi map) and threads it onto a new Signedness field on CmpLt/CmpGt/CmpLtEq/CmpGtEq and their 16-bit variants.

widen() also sign-extends when the source temp is signed, via a new IrOp::SignExtend, so var w: i16 = some_i8_neg round-trips negative values instead of zero-extending to $00F6.

Casts to u8/u16 strip the signed flag so expr as u16 < other_u16 stays on the unsigned path.

See examples/signed_compare.ne — four pip sprites gate on signed comparisons; three light (signed-correct) and one stays dark (would only light if the lowering regressed to unsigned).

2. Metatiles + collision (§H)

metatileset Name { metatiles: [{ id, tiles, collide }, ...] } and room Name { metatileset: M, layout: [...] } ship as a cohesive feature. Each metatile bundles 4 CHR tile indices (TL/TR/BL/BR) plus a collide flag; rooms lay them out as a 16×15 grid that the compiler expands at compile time into three PRG blobs:

  • __room_tiles_<name> — 960-byte 32×30 nametable
  • __room_attrs_<name> — 64-byte attribute table
  • __room_col_<name> — 240-byte collision bitmap (one byte per metatile)

paint_room Name reuses the existing load_background vblank-safe update machinery for the nametable blit and additionally installs the room's collision bitmap pointer into ZP_ROOM_COL_LO/ZP_ROOM_COL_HI (ZP $18/$19).

collides_at(x: u8, y: u8) -> bool JSRs into a small runtime helper that reads (room_col),Y with Y = (y & 0xF0) | (x >> 4) and returns the 0/1 byte directly. Gated on a __collides_at_used marker — programs that declare a room but never query it pay zero bytes for the subroutine.

parse_byte_array grows a [value; count] shortcut so 240-entry layouts stay readable.

See examples/metatiles_demo.ne: a probe walks right, bounces off the right wall when collides_at fires, and lands on the left side of the playfield by frame 180.

3. Register-allocator follow-up (Code quality)

remove_dead_loads now steps past opcodes that touch neither A nor the flags an LDA sets — LDX/LDY/INX/INY/DEX/DEY and the flag ops (CLC/SEC/CLI/SEI/CLD/SED/CLV) on top of the INC/DEC/STX/STY opcodes the pass already stepped past.

The highest-leverage case is every single-tile draw. Copy propagation and dead-store elimination together leave:

LDA #<y>          ; stray producer, value never consumed
LDY oam_cursor
LDA #<y>          ; real load before STA
STA $0200,Y

The first LDA was surviving because the pass bailed on the LDY. With the step-past, it drops — one LDA gone per draw, 2 bytes each.

LDA-count reductions on the committed examples:

example before after Δ
platformer 242 221 -8.7 %
war 785 754 -4.0 %
pong 843 827 -1.9 %

Audio-goldens churn

The cycle savings shift the main-loop ↔ NMI boundary in audio-emitting programs, which re-times which frame each SFX trigger lands in. Six audio hashes re-baselined as a result: audio_demo, friendly_assets, noise_triangle_sfx, platformer, pong, war.

All 50 PNG goldens, the platformer/war/pong demo gifs, and every non-audio program stay byte-identical. The re-baselined audio is still sample-accurate; what changed is the first-SFX offset within the captured 132 084-sample window. This tradeoff is spelled out in docs/future-work.md's register-allocator section along with the remaining wins worth chasing next (cross-block A-tracking, X/Y allocation, spill skipping at codegen time).

Test plan

  • cargo test --all-targets — 779 pass
  • cargo fmt --check, cargo clippy --all-targets -- -D warnings clean
  • tests/emulator/run_examples.mjs — 50/50 ROMs match their (re-baselined) goldens
  • Every examples/*.nes re-committed to match its .ne source
  • docs/{platformer,war,pong}.gif regenerated (byte-identical to the pre-change gifs)

What was added / removed in docs/future-work.md

Removed from §A and §H — both features ship in this PR.
Rewrote the register-allocator section to describe what's shipped (the peephole step-past), the remaining wins, and the audio-shift constraint.
Updated priority ranking to reflect the three top items being done.

claude added 3 commits April 19, 2026 00:17
Closes the §A follow-up gap: ordering compares (`<`, `<=`, `>`, `>=`)
on signed integer types now use the canonical 6502 `CMP / SBC / BVC /
EOR #$80` overflow-correction idiom so the N flag reflects the true
sign of the difference, instead of the previous BCC/BCS-based path
that always treated `$FFxx` as greater than `$00yy`.

The same change also fixes narrow-to-wide widening: assigning a
runtime `i8` expression to an `i16` variable now sign-extends the
high byte via a new `IrOp::SignExtend` op instead of zero-extending
it, so `var w: i16 = some_i8_neg` round-trips negative values.

The lowerer tracks signedness on each IR temp (analogous to the
existing `wide_hi` map) and threads it onto the new `Signedness`
field of `CmpLt`/`CmpGt`/`CmpLtEq`/`CmpGtEq` and their 16-bit
variants. The optimizer's constant-folder uses the same flag to
fold compares correctly under either signedness. Casts to `u8`/`u16`
strip the signed flag so an explicit `as` opt-out stays unsigned.

`examples/signed_compare.ne` exercises both bit widths through the
emulator harness — the four pip sprites at the top of the screen
show three lit (signed-correct) and one dark (would only light if
the compare regressed to unsigned semantics).
…_at`

Closes §H. 2×2 metatiles and a parallel collision map are now a
first-class construct. `metatileset Name { metatiles: [{ id, tiles,
collide }, ...] }` declares a library of 2×2 tile bundles. `room Name
{ metatileset: M, layout: [...] }` lays them out on a 16×15 grid. The
compiler expands each room at compile time into:

- a 960-byte nametable (`__room_tiles_<name>`)
- a 64-byte attribute table (`__room_attrs_<name>`)
- a 240-byte collision bitmap (`__room_col_<name>`)

`paint_room Name` reuses the vblank-safe `load_background` update
machinery for the nametable blit and installs the collision bitmap
pointer into `ZP_ROOM_COL_LO`/`ZP_ROOM_COL_HI` (ZP $18/$19).
`collides_at(x, y)` JSRs into a small runtime helper that reads
`(room_col),Y` with `Y = (y & 0xF0) | (x >> 4)` and returns 0/1.
The helper links in only when the `__collides_at_used` marker is
emitted, so programs that declare a room but never query it pay
zero bytes for the subroutine.

`parse_byte_array` grows a `[value; count]` shortcut — 240-entry
`layout` arrays are unwieldy to spell out a byte at a time.

See `examples/metatiles_demo.ne` for the end-to-end flow: a probe
sprite bounces off walls via `collides_at` and lands on the left
side of the playfield at frame 180 — direct evidence that the
collision query works.

Also defers the register-allocator work from §"Code quality /
tooling" and documents the audio-goldens constraint in future-work
so the next agent sees it.
`remove_dead_loads` now scans past opcodes that touch neither A nor
the flags an LDA sets, so a redundant LDA gets caught by its
successor's overwrite even when an index load or counter bump sits
between them. The extension covers LDX/LDY/INX/INY/DEX/DEY and the
flag ops (CLC/SEC/CLI/SEI/CLD/SED/CLV) alongside the INC/DEC/STX/STY
opcodes the pass already stepped past.

The highest-leverage case is the shape every single-tile `draw`
emits. After copy propagation and dead-store elimination do their
work, the stream reads:

    LDA #<y>      ; stray producer, value never consumed
    LDY oam_cursor
    LDA #<y>      ; real load before STA
    STA $0200,Y

The first LDA was surviving because the pass bailed on the LDY.
With the step-past, it drops. One LDA gone per draw, 2 bytes each.

Measured LDA-count reduction on committed examples:

  platformer  242 → 221   (-21, -8.7 %)
  war         785 → 754   (-31, -4.0 %)
  pong        843 → 827   (-16, -1.9 %)

**Audio goldens.** The cycle savings shift the main-loop/NMI boundary
in audio-emitting programs, which re-times which frame each SFX
trigger lands in. Six audio hashes re-baseline as a result:
audio_demo, friendly_assets, noise_triangle_sfx, platformer, pong,
war. All 50 PNG goldens, the platformer/war/pong demo gifs, and
every non-audio program stay byte-identical. The re-baselined
output is still sample-accurate; what changed is the first-SFX
offset within the captured 132 084-sample window. This is the
audio-shift tradeoff documented in future-work.

Two new peephole unit tests lock in the behaviour:
- `dead_load_elim_steps_past_ldx_ldy` — the DrawSprite shape folds.
- `dead_load_elim_preserves_lda_when_used_by_shift` — a subsequent
  ASL on A keeps the LDA alive across an intervening LDY.

Also updates future-work.md to reflect the shipped change and the
remaining register-allocator wins worth chasing next.
@imjasonh imjasonh changed the title ir/codegen: signed comparison lowering for i8/i16 signed compares, metatiles + collision, register-allocator pass Apr 19, 2026
@imjasonh imjasonh merged commit 6b1cc98 into main Apr 19, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants