signed compares, metatiles + collision, register-allocator pass#36
Merged
signed compares, metatiles + collision, register-allocator pass#36
Conversation
Closes the §A follow-up gap: ordering compares (`<`, `<=`, `>`, `>=`) on signed integer types now use the canonical 6502 `CMP / SBC / BVC / EOR #$80` overflow-correction idiom so the N flag reflects the true sign of the difference, instead of the previous BCC/BCS-based path that always treated `$FFxx` as greater than `$00yy`. The same change also fixes narrow-to-wide widening: assigning a runtime `i8` expression to an `i16` variable now sign-extends the high byte via a new `IrOp::SignExtend` op instead of zero-extending it, so `var w: i16 = some_i8_neg` round-trips negative values. The lowerer tracks signedness on each IR temp (analogous to the existing `wide_hi` map) and threads it onto the new `Signedness` field of `CmpLt`/`CmpGt`/`CmpLtEq`/`CmpGtEq` and their 16-bit variants. The optimizer's constant-folder uses the same flag to fold compares correctly under either signedness. Casts to `u8`/`u16` strip the signed flag so an explicit `as` opt-out stays unsigned. `examples/signed_compare.ne` exercises both bit widths through the emulator harness — the four pip sprites at the top of the screen show three lit (signed-correct) and one dark (would only light if the compare regressed to unsigned semantics).
…_at`
Closes §H. 2×2 metatiles and a parallel collision map are now a
first-class construct. `metatileset Name { metatiles: [{ id, tiles,
collide }, ...] }` declares a library of 2×2 tile bundles. `room Name
{ metatileset: M, layout: [...] }` lays them out on a 16×15 grid. The
compiler expands each room at compile time into:
- a 960-byte nametable (`__room_tiles_<name>`)
- a 64-byte attribute table (`__room_attrs_<name>`)
- a 240-byte collision bitmap (`__room_col_<name>`)
`paint_room Name` reuses the vblank-safe `load_background` update
machinery for the nametable blit and installs the collision bitmap
pointer into `ZP_ROOM_COL_LO`/`ZP_ROOM_COL_HI` (ZP $18/$19).
`collides_at(x, y)` JSRs into a small runtime helper that reads
`(room_col),Y` with `Y = (y & 0xF0) | (x >> 4)` and returns 0/1.
The helper links in only when the `__collides_at_used` marker is
emitted, so programs that declare a room but never query it pay
zero bytes for the subroutine.
`parse_byte_array` grows a `[value; count]` shortcut — 240-entry
`layout` arrays are unwieldy to spell out a byte at a time.
See `examples/metatiles_demo.ne` for the end-to-end flow: a probe
sprite bounces off walls via `collides_at` and lands on the left
side of the playfield at frame 180 — direct evidence that the
collision query works.
Also defers the register-allocator work from §"Code quality /
tooling" and documents the audio-goldens constraint in future-work
so the next agent sees it.
`remove_dead_loads` now scans past opcodes that touch neither A nor
the flags an LDA sets, so a redundant LDA gets caught by its
successor's overwrite even when an index load or counter bump sits
between them. The extension covers LDX/LDY/INX/INY/DEX/DEY and the
flag ops (CLC/SEC/CLI/SEI/CLD/SED/CLV) alongside the INC/DEC/STX/STY
opcodes the pass already stepped past.
The highest-leverage case is the shape every single-tile `draw`
emits. After copy propagation and dead-store elimination do their
work, the stream reads:
LDA #<y> ; stray producer, value never consumed
LDY oam_cursor
LDA #<y> ; real load before STA
STA $0200,Y
The first LDA was surviving because the pass bailed on the LDY.
With the step-past, it drops. One LDA gone per draw, 2 bytes each.
Measured LDA-count reduction on committed examples:
platformer 242 → 221 (-21, -8.7 %)
war 785 → 754 (-31, -4.0 %)
pong 843 → 827 (-16, -1.9 %)
**Audio goldens.** The cycle savings shift the main-loop/NMI boundary
in audio-emitting programs, which re-times which frame each SFX
trigger lands in. Six audio hashes re-baseline as a result:
audio_demo, friendly_assets, noise_triangle_sfx, platformer, pong,
war. All 50 PNG goldens, the platformer/war/pong demo gifs, and
every non-audio program stay byte-identical. The re-baselined
output is still sample-accurate; what changed is the first-SFX
offset within the captured 132 084-sample window. This is the
audio-shift tradeoff documented in future-work.
Two new peephole unit tests lock in the behaviour:
- `dead_load_elim_steps_past_ldx_ldy` — the DrawSprite shape folds.
- `dead_load_elim_preserves_lda_when_used_by_shift` — a subsequent
ASL on A keeps the LDA alive across an intervening LDY.
Also updates future-work.md to reflect the shipped change and the
remaining register-allocator wins worth chasing next.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Works through the top three items in the priority ranking at the bottom of
docs/future-work.md.1. Signedness on Cmp16/Cmp ops (§A follow-up)
Ordering compares (
<,<=,>,>=) on signed integers now use the canonical 6502CMP / SBC / BVC / EOR #$80overflow-correction idiom so the N flag reflects the true sign of the difference. The lowerer tracks signedness on each IR temp (analogous to the existingwide_himap) and threads it onto a newSignednessfield onCmpLt/CmpGt/CmpLtEq/CmpGtEqand their 16-bit variants.widen()also sign-extends when the source temp is signed, via a newIrOp::SignExtend, sovar w: i16 = some_i8_neground-trips negative values instead of zero-extending to$00F6.Casts to
u8/u16strip the signed flag soexpr as u16 < other_u16stays on the unsigned path.See
examples/signed_compare.ne— four pip sprites gate on signed comparisons; three light (signed-correct) and one stays dark (would only light if the lowering regressed to unsigned).2. Metatiles + collision (§H)
metatileset Name { metatiles: [{ id, tiles, collide }, ...] }androom Name { metatileset: M, layout: [...] }ship as a cohesive feature. Each metatile bundles 4 CHR tile indices (TL/TR/BL/BR) plus acollideflag; rooms lay them out as a 16×15 grid that the compiler expands at compile time into three PRG blobs:__room_tiles_<name>— 960-byte 32×30 nametable__room_attrs_<name>— 64-byte attribute table__room_col_<name>— 240-byte collision bitmap (one byte per metatile)paint_room Namereuses the existingload_backgroundvblank-safe update machinery for the nametable blit and additionally installs the room's collision bitmap pointer intoZP_ROOM_COL_LO/ZP_ROOM_COL_HI(ZP$18/$19).collides_at(x: u8, y: u8) -> boolJSRs into a small runtime helper that reads(room_col),YwithY = (y & 0xF0) | (x >> 4)and returns the 0/1 byte directly. Gated on a__collides_at_usedmarker — programs that declare a room but never query it pay zero bytes for the subroutine.parse_byte_arraygrows a[value; count]shortcut so 240-entry layouts stay readable.See
examples/metatiles_demo.ne: a probe walks right, bounces off the right wall whencollides_atfires, and lands on the left side of the playfield by frame 180.3. Register-allocator follow-up (Code quality)
remove_dead_loadsnow steps past opcodes that touch neither A nor the flags an LDA sets — LDX/LDY/INX/INY/DEX/DEY and the flag ops (CLC/SEC/CLI/SEI/CLD/SED/CLV) on top of the INC/DEC/STX/STY opcodes the pass already stepped past.The highest-leverage case is every single-tile
draw. Copy propagation and dead-store elimination together leave:The first LDA was surviving because the pass bailed on the LDY. With the step-past, it drops — one LDA gone per draw, 2 bytes each.
LDA-count reductions on the committed examples:
Audio-goldens churn
The cycle savings shift the main-loop ↔ NMI boundary in audio-emitting programs, which re-times which frame each SFX trigger lands in. Six audio hashes re-baselined as a result: audio_demo, friendly_assets, noise_triangle_sfx, platformer, pong, war.
All 50 PNG goldens, the platformer/war/pong demo gifs, and every non-audio program stay byte-identical. The re-baselined audio is still sample-accurate; what changed is the first-SFX offset within the captured 132 084-sample window. This tradeoff is spelled out in
docs/future-work.md's register-allocator section along with the remaining wins worth chasing next (cross-block A-tracking, X/Y allocation, spill skipping at codegen time).Test plan
cargo test --all-targets— 779 passcargo fmt --check,cargo clippy --all-targets -- -D warningscleantests/emulator/run_examples.mjs— 50/50 ROMs match their (re-baselined) goldensexamples/*.nesre-committed to match its.nesourcedocs/{platformer,war,pong}.gifregenerated (byte-identical to the pre-change gifs)What was added / removed in
docs/future-work.mdRemoved from §A and §H — both features ship in this PR.
Rewrote the register-allocator section to describe what's shipped (the peephole step-past), the remaining wins, and the audio-shift constraint.
Updated priority ranking to reflect the three top items being done.