-
Notifications
You must be signed in to change notification settings - Fork 0
Inside the Apple Macintosh Display Card 24AC
The Apple Macintosh Display Card 24AC is a 24-bit colour NuBus display card, manufactured by Radius and white-labelled by Apple, whose headline feature is a QuickDraw accelerator. One thing to note here is that Apple's Tech Info Library does say "QuickDraw Acceleration" — it doesn't explicitly say hardware acceleration, which is a slight hint of things to come.
There is, as far as I've been able to find, no real documentation available on the internet regarding the hardware of the card, so we can mainly learn about this hardware by studying the software written for and running with the card. That software mainly boils down to
-
NuBus declaration ROM. Every NuBus card carries one. On this card it holds the ordinary video driver — the code the Slot Manager and the boot ROM use to bring the card up as a screen: monitor sensing, mode/depth selection, the CLUT/RAMDAC, VBL interrupts, and the linear framebuffer. Nothing out of the ordinary, and really no clues regarding the "acceleration" itself.
-
An
INIT/cdev(the control panel). This is loaded when the System is started, and it is the entire subject of this article — the acceleration lives here. In this case I've used the Apple-brandedINIT/cdev, but I strongly believe that the Radius-branded software is very, very similar.Inside the
INIT/cdev, there are basically three types of resources that seem to hold code — the expectedINIT, as well as two code resource types I haven't really encountered before:Resource Role INIT 0/1/2the installer — startup gating, card discovery, trap-table patching, boot splash for the init QCOD 0the hardware-assisted engine — self-init, a per-slot aperture table, the trap-patch table, and ten hardware "hook" bodies QDPA 0/1/2the pure-software bodies — a stretch DDA, a region-masked 1-bit fill, and the basic copy loop
So from our perspective in this text, the two code resources that matter are QCOD and QDPA. Between them they carry
thirteen replacement routines for QuickDraw's innermost drawing loops. That is the whole accelerator. Before digging
into the details, let's briefly look at the QuickDraw architecture.
QuickDraw's public calls like CopyBits, FillRect, and so on (as well as internal calls like StretchBits) are not
monolithic. Internally they funnel down to a small set of "bottleneck" primitives: the tightest inner loops that
actually move and combine pixels. And Apple deliberately dispatched those inner loops through the trap table, the
same numbered-vector mechanism used for Toolbox calls, even though these internal primitives are "private" to
QuickDraw. The interesting result is of course that they can be replaced/patched like any "public" trap.
The full set of these replaceable hooks is below — the routine, its trap number (its slot in the dispatch table), and the trap word (the instruction that invokes it):
| Hook (routine) | Trap # | Trap word | What it does |
|---|---|---|---|
FastSlabMode |
$30C |
$AB0C |
solid-slab fill (line/polygon path) |
SetUpStretch |
$324 |
$AB24 |
stretch / scale setup |
bMAIN0 |
$330 |
$AB30 |
basic copy, main loop |
bSETUP8 |
$334 |
$AB34 |
patterned fill, 8 bpp+ |
bXMAIN8 |
$338 |
$AB38 |
XOR / transfer-mode fill |
bEND0 |
$340 |
$AB40 |
basic copy, tail |
bSetup0 |
$358 |
$AB58 |
68040 fast copy, setup |
bLeft0 |
$359 |
$AB59 |
68040 fast copy, left edge |
rMASK0 |
$35A |
$AB5A |
region-masked 1-bit fill |
rMASK8 |
$35E |
$AB5E |
region-masked 8 bpp fill |
rXMASK8 |
$362 |
$AB62 |
region-masked XOR fill |
slXMASK8 |
$384 |
$AB84 |
scaled masked fill |
stScanLoop |
$399 |
$AB99 |
stretch scanline DDA |
As far as I can see, BitBlt, RgnBlt and StretchBits invoke these by trap (_bMAIN0, _rMASK8, …) rather than by
a direct JSR. Because a trap is just a table entry, any INIT can call _SetToolTrapAddress and substitute its own
routine — and QuickDraw will happily call the replacement.
The beauty here is of course that if we "accelerate" these thirteen routines, we are basically accelerating a number of different higher-level routines like
BitBlt,RgnBlt,StretchBitsand so on.
The names encode a taxonomy worth internalising, because it tells you at a glance what each hook does and which will be worth accelerating:
-
…0vs…8— pixel depth.…0is the 1-bit / monochrome family;…8is the chunky 8-bit-and-deeper (colour) family. -
b…— the plain rectangular blit (BitBlt). -
r…— the region-masked (arbitrary-shape-clipped) variants (RgnBlt), driven by a run-encoded instruction stream. -
X— the XOR / "big pattern" (indexed-offset pattern) variant. -
sl…— a scaled (stretch) masked variant.
Two consequences fall straight out of this taxonomy, and they define the whole strategy of the card:
- The colour (
…8) fill/pattern/XOR/region hooks are where a colour card spends its time, so those are the ones worth putting in silicon. - A plain
CopyBitsof already-formed pixels (bMAIN0/bEND0) is not a fill — there's no pattern to replicate, no raster-op to apply — so, as we'll see, the card doesn't hardware-accelerate it at all.
The engine the control panel drives is simple enough to describe in full: a slot of NuBus address space, three registers, and one clever piece of address aliasing. Here is the whole thing.
The card is an ordinary NuBus card in some slot. Write its slot number as s — a single hex digit in the range
$9–$E on these machines. That slot's address space begins at 0xFs000000, where the s is literally the slot
digit: slot $9 lives at 0xF9000000, slot $E at 0xFE000000. (The driver computes it by masking the card's
baseAddr with 0xFF000000.) So 0xFs000000 throughout this article means "the base of whichever slot the card is
in," and every derived address — 0xFs0FE000 for the operand aperture, and so on — carries the same digit. Crucially,
that slot space is only directly addressable when the CPU is in 32-bit addressing mode. On a 24-bit Mac (or a Mac
booted in 24-bit Memory Manager mode), the top byte of every address is stripped, so 0xFs…… is unreachable without a
mode switch.
It's worth being precise about what that costs, and when. Installing the accelerator needs no 32-bit mode at all — the
trap-table patches are ordinary low-RAM writes — so the cdev loads and installs on a 24-bit boot just fine. The only
part of setup that has to reach the card is a one-time register probe, and it gets there by temporarily flipping to
32-bit with _SwapMMUMode, then flipping straight back:
$0122 MOVEQ #1,D0 / _SwapMMUMode ; enter 32-bit mode
$0126 MOVE.W $D00402(A0),D4 ; read STATUS
$012E MOVE.B $D40402(A0),D3 ; read CONFIG
$0136 _SwapMMUMode ; restore prior modeThat bracket is not a requirement that the machine be running in 32-bit mode — it is simply how you reach a NuBus
card's 32-bit slot address from a 24-bit context, so the probe (like the install) runs regardless of the boot mode. The
per-blit path is the opposite trade-off: the engine can only be driven when the aperture is already reachable, and the
hook can't afford a _SwapMMUMode on every scanline of every blit. So it does something cheaper: it checks the
low-memory MMU32bit flag ($0CB2) and, if the machine isn't already in 32-bit mode, it simply doesn't accelerate:
$01E0 TST.B $0CB2 ; MMU32bit — already 32-bit?
$01E4 BEQ $0354 ; no → can't reach the aperture → chain to ROMThis is the first of many eligibility gates we'll see. The card only accelerates when acceleration is free of overhead; otherwise it steps aside. (It also means that on a stock 24-bit System 6 machine the engine is essentially dormant and QuickDraw uses the CPU — a fact that shapes how much the card ever helped in practice.)
The card exposes two register longwords high in slot space, of which the driver uses three bytes:
| Slot-relative address | Access | Role |
|---|---|---|
+$D00402 |
read | STATUS — low 3 bits = current pixel-depth code; bit 3 = card-class/VRAM-organisation |
+$D40402 |
read (init only) | CONFIG — bit 0 = geometry variant |
+$D40403 |
write | CONTROL / MODE — latches what the next aperture writes will do |
STATUS[2:0] is read live at the top of every accelerated blit and used to index an 8-entry stride table (the current
rowBytes for the depth the user last picked in the Monitors panel). CONTROL ($D40403) is the command register —
writing to it selects the engine's operating mode:
Value written to $D40403
|
Meaning |
|---|---|
$01 |
pattern / solid fill — replicate the latched operand |
$03 |
stretch / scale |
$7F |
fast block copy (all planes, straight transfer) |
computed $00–$3F
|
raster-op / transfer-mode derived from the QuickDraw mode |
Here is the single most important architectural idea, and it's beautifully simple. The card maps its framebuffer twice, four megabytes apart:
- The passive bank at offset
0is plain VRAM. Reads and writes here are just pixels — this is thebaseAddrQuickDraw hands to the hooks. - The active bank at offset
+0x400000is a hardware-transforming alias of the same cells. A write through this window is not stored verbatim; the engine interprets it according to the mode last latched into$D40403.
passive bank active bank (alias, +0x400000)
0x000000 ─ plain pixels ─┐ 0x400000 ─ writes are TRANSFORMED ─┐
... │ same ... (engine applies the mode) │
0x3FFFFF ┘ VRAM 0x7FFFFF ┘
Plus a small operand aperture near the top of VRAM (0x0FE000 on a small-VRAM card, 0x3FE000 on a large one),
with its own +0x400000 commit alias. That's where you load the pattern/colour the fill engine will replicate.
So the whole engine has just three moving parts: a mode latch ($D40403), an operand register (loaded through
the operand aperture and committed), and a transforming write window (dest + 0x400000). No command FIFO, no DMA
descriptors, no completion interrupt. Everything is driven by plain CPU stores to memory-mapped windows, and — as far as
the driver is concerned — the engine is synchronous: it never polls a "busy" bit before or after an operation.
QCOD's one-time self-init walks the Slot Manager's device list, finds every slot holding a 24AC, reads that card's
STATUS/CONFIG to pick geometry, and builds a table of per-slot aperture base addresses at QCOD+0x68:
$00D4 MOVEQ #8,D1 ; slot loop 9..E
loop: ADDQ #1,D1 / BTST D1,D2 ; D2 = bitmap of slots holding the card
$0116 MOVE.L $2A(A1),D0 ; card baseAddr from the slot device record
$011A ANDI.L #$FF000000,D0 ; → slot base 0xFs000000
... read STATUS/CONFIG, pick geometry constants ...
$01A8 MOVE.L #$F0,D0 / OR.W D1,D0 / ROR.L #8,D0 ; build 0xFs000000
$01B2 OR.L D0,D4 ; + the operand-aperture offset
$01B4 MOVE.L D4,$0(A2,D1.W*4) ; store per-slot aperture baseThe result: QCOD+0x68[slot] holds 0xFs0FE000 (or 0xFs3FE000). Alongside it, QCOD+0x00 holds a capability
bitmap — one bit per slot that has a 24AC — so that every hook can, in two instructions, decide "is this blit even
going to one of my cards?"
The engine speaks a tiny language of memory writes. There are exactly three verbs: load an operand, fill, and copy. Everything the card accelerates is built from these.
Before a pattern or solid fill, you load the 32-bit operand (the pattern longword or fill colour) through the operand
aperture and commit it with a magic write of 4 through the aperture's own +0x400000 window — done twice,
presumably for timing/handshake:
write CONTROL($D40403) = $01 ; fill mode
write LONG [APER] = OPERAND ; the pattern long / fill colour
write LONG [APER + 0x400000] = 4 ; commit (driver writes it TWICE)
write LONG [APER + 0x400000] = 4
; thereafter reading [APER] returns OPERAND — the driver reads it back for edges
The driver keeps a one-entry cache: CMP.L (APER),newOperand / BEQ skip. If the pattern longword hasn't changed since
the last fill, it skips the reload entirely. Small, but it matters — most fills in a row use the same colour.
With an operand latched, the body of each scanline is filled by writing run lengths into the active bank:
A = destPixelAddr + 0x400000 ; active-bank alias of the dest position
loop over the middle of the scanline:
L = min(bytesRemaining, stripeWidth) ; clamp to the engine's stripe limit
write LONG [A] = $40000000 | L ; engine fills L bytes here with the operand
A += L
bytesRemaining -= L
Note the value written: $40000000 | L. This is a subtle and load-bearing detail. Bit 30 is a command/enable
flag that sits above the run-length counter; the actual byte count is the low bits. Take the whole longword as the
count and the fill runs to the end of VRAM — a spectacular full-screen smear (and, because a pull-down menu's fills are
narrow and bottom-up, exactly the kind of mistake that surfaces as white-and-black bands across the desktop rather than
an obvious crash). The count is written & ~$40000000; nothing more.
The copy path (CONTROL=$7F, used by ScrollRect and window CopyBits) is the most interesting, and the place where a
naïve model goes wrong twice.
It is not "stream the source pixels through the window." It is a two-write handshake per stripe:
write CONTROL($D40403) = $7F
for each scanline stripe:
write LONG [srcPos + 0x400000] = L ; no flag → LATCH source = srcPos, len = L
write LONG [dstPos + 0x400000] = $40000000 | L ; flag → COPY L bytes srcPos → dstPos
The flagless write latches a source position and length (it stores nothing); the bit-30-flagged write executes the copy from the latched source to this destination — a source-active store followed by a dest-active store:
MOVE.L D4,(A4) ; write to src+0x400000, no flag → latch source
MOVE.L D5,(A5) ; write to dst+0x400000, flag set → copyAnd there's a second subtlety worth pinning down: the latched source is an auto-incrementing pointer. After each
execute it advances by the number of bytes copied, so a single latch can feed several executes. The driver exploits
this whenever a copy's destination would straddle an 8 KB VRAM boundary (dst & 0x1FFF): it splits the copy at the
boundary into two executes from one latch, supplying only the second destination and letting the engine continue the
source itself:
LATCH src=$a788 len=512
COPY dst=$7f88 len=120 ; 0x7f88 + 120 = 0x8000 (crosses the 8 KB line)
COPY dst=$8000 len=392 ; SAME latch — source auto-advances 0xa788 → 0xa800
Treat the source as pinned to the latched value and the second execute re-reads the first 120 bytes of the source: you
get a faint, regular dotting along every boundary-crossing scanline — invisible on a page-jump scroll (whose copies
rarely straddle a boundary) but glaringly obvious when a text window is dragged down one line at a time. The source
pointer must advance by L after every execute.
The SetUpStretch hook latches CONTROL=$03 once at the start of a scaled blit; the per-output-row work is then driven
by a software digital differential analyser (stScanLoop) that calls the fill/copy verbs above for each row. No new
engine registers — stretch is just fill/copy under a scaling loop.
That's the entire hardware interface: a mode latch, an operand register, and two transforming write verbs (fill and
copy), each addressed through a +0x400000 window. Now let's see how the QuickDraw hooks use it.
The thirteen hooks split cleanly across the two code resources by strategy:
| Resource | Hooks | Strategy |
|---|---|---|
QDPA 0/1/2 |
bMAIN0, bEND0, rMASK0 (1-bit), stScanLoop
|
pure software — tighter 68K loops, no card registers touched |
QCOD 0 |
bSETUP8, bXMAIN8, bSetup0, bLeft0, rMASK8, rXMASK8, slXMASK8, FastSlabMode, SetUpStretch, (rMASK0 hw) |
hardware-assisted — each programs $D40403 and streams through +0x400000
|
The split is mechanical: QDPA 0/1/2 contain zero card-register accesses; every QCOD body reads $D00402 and
writes $D40403.
Every QCOD (hardware) body has the same skeleton:
1. ELIGIBILITY GATE → on ANY failure: BRA reject → chain to the routine we displaced
• destination is one of our cards (top byte of dest vs the capability bitmap)
• width D2 (LONGCNT) ≥ 0x20 (≥ 32 longs — big enough to be worth offloading)
• MMU32bit ($0CB2) set (32-bit mode — aperture reachable)
• per-slot aperture base ≠ 0 (card actually present)
• dest row-bump longword-aligned (A3 & 3 == 0)
2. DERIVE slot base = destBase & 0xFF000000 ; index the QCOD+0x68 table
3. PROGRAM the engine: read STATUS → stride; write CONTROL; load the operand
4. STREAM the mask-free middle through +0x400000; do masked EDGES in software
5. advance per scanline (via the BitBlt A6 frame vars), loop, RTS
The QDPA (software) bodies skip steps 2–4 and just run a tight CPU loop. Let's look at both kinds.
A small surprise: QuickDraw's generic copy bottleneck — the basic longword loop bMAIN0/bEND0 — is left as a
pure-software body in QDPA 2, touching no card registers. That is not because copies can't be accelerated — the
$7F mode is a hardware straight-copy, and the card does route the alignment-friendly cases to it (window
ScrollRect through the region path, §5.4, and the 68040 fast-copy bSetup0/bLeft0, §5.3). It's that this generic
hook must also cover the general case — arbitrary bit-shifted, unaligned, possibly off-card-source copies — which
the alignment-restricted hardware path can't take, so it's left as an optimised CPU loop. The card's contribution here
is just a better loop.
Compare the ROM's inner loop with the patch. The ROM's basic-copy loop bounces back out to BitBlt's shared NXTSRC
between scanlines:
bMAIN0A BFEXTU (A4){D6:0},D0 ; read 32 shifted source bits
ADD A0,A4 / EOR.L D7,D0 / AND.L D1,D0 / NOT.L D1
AND.L (A5),D1 / OR.L D1,D0 / MOVE.L D0,(A5) / ADD A0,A5
SUB #1,D2 / BEQ bEND0A / BLT NXTSRC ; NXTSRC lives out in the BitBlt bodyThe patch internalises the entire scanline loop — inner and outer — so it never pays the per-scanline re-dispatch back into BitBlt:
; --- the patch's OWN per-scanline loop (QDPA 2) ---
$007C ADDA.W -$25E(A6),A4 ; A4 += SRCBUMP (frame var)
$0080 ADDA.W A3,A5 ; A5 += dest row bump
$0082 SUBQ.W #1,-$260(A6) ; HEIGHT--
$0086 BEQ.S $0092 ; done → RTS
$0088 MOVE.L -$26C(A6),D1 ; reload FIRSTMASK
$008C MOVE.W -$26E(A6),D2 ; reload LONGCNT
$0090 JMP (A1) ; straight back into the main loop — no re-dispatch
$0092 RTSThe only "hardware" involved is that A5 already points at fast card VRAM. The win is real but modest: fewer branches,
no per-row trip through the dispatcher. It is emblematic of the whole card's philosophy — the CPU still walks every
pixel; you just make it walk more efficiently and, where you can, hand the per-pixel work to silicon.
(Note also what the patch isn't: it's a plain 68020-class BFEXTU/MOVE.L loop, not a 68040 MOVE16 block copy. The
card ships no 68040-tuned software blitter of its own — on a 040, the fast-copy hooks fall back to the ROM's own
MOVE16 path.)
Now the interesting half. bSETUP8 is QuickDraw's patterned fill for 8-bit and deeper pixels. The ROM version
expands the 8×8 pattern to 16 longwords and, per destination longword, does a read-modify-write to OR the masked pattern
in — pure CPU, one longword at a time.
The patch replaces that RMW loop with hardware fill. Walking the body (QCOD+0x01C2):
; ---- eligibility gate (any fail → $0354 chain-to-ROM) ----
$01C2 MOVE.L -$F4(A6),D0 / ROL.L #8 / SUBI.W #$F8,D0 / BMI $0354 ; dest slot in range?
$01D0 BTST D0,(QCOD+0) / BEQ $0354 ; slot in the capability bitmap?
$01D8 CMPI.W #$20,D2 / BLT $0354 ; width ≥ 32 longs?
$01E0 TST.B $0CB2 / BEQ $0354 ; MMU32bit set?
; ---- derive this slot's aperture base ----
$01EE MOVE.L A5,D3 / ROL.L #8 / ANDI.L #$FF,D3 / SUBI.W #$F0,D3 ; slot index from dest
$0202 MOVE.L $0(A0,D3.W*4),D3 / MOVE.L D3,(A0) / BEQ $0354 ; base; bail if 0
$0212 MOVE.L A3,D3 / ANDI.W #3,D3 / BNE $0354 ; dest row-bump long-aligned?
; ---- per pattern longword: load the pattern into the card, then let it fill ----
$027C MOVE.L (A4)+,D6 ; next pattern long
$027E CMP.L (A6),D6 / BEQ $02AA ; unchanged? skip reprogramming (1-entry cache)
$028C MOVE.B #$1,$D40403(,D1.L*1) ; latch CONTROL = fill
$0296 MOVE.L D6,(A6) ; write pattern long to the operand aperture
$0298 ADDA.L #$400000,A6 ; → the commit window
$029E MOVEQ #4,D1 / MOVE.L D1,(A6) / MOVE.L D1,(A6) ; commit (twice)
; ---- left edge in software (masked), then the middle through the accel aperture ----
$02AA MOVE.L D6,D0 / AND.L D4,D0 / AND.L (A5),D1 / OR.L D1,D0 / MOVE.L D0,(A5)+ ; masked edge long
$02D2 ADDA.L #$400000,A5 ; switch dest to the active bank
... stream the scanline middle — the card replicates the loaded pattern ...The shape is exactly the §4 skeleton: latch fill mode, load the pattern, commit, then write the mask-free middle of
the scanline through dest + 0x400000 where silicon replicates the pattern into VRAM. The ragged left/right edges —
the partial longwords that need per-pixel masking — are still done by the CPU through the passive bank, because the
engine fills whole units, not masked fractions.
bXMAIN8 is the same skeleton for the XOR / transfer-mode case. It computes a raster-op code from the QuickDraw
mode word and writes that to $D40403 instead of $01:
$107A MOVE.B $D00402(,D3.L*1),D3 / ANDI.W #7,D3 / MOVE.W tbl(PC,D3*2),D3 ; STATUS → stride
$10AA MOVE.B D0,$D40403(,D3.L*1) ; write the COMPUTED transfer mode
$10F4 ADDA.L #$400000,A0 ; second aperture
$10FE MOVE.L $0(A4,D3.W*1),(A2)+ / DBF D1,… ; stream the runThe code is derived from the QuickDraw mode word D7 by if (D7 & 4) code = D7>>2; else code = ((D7<<1)+4)>>2 — i.e.
QuickDraw's transfer mode is mapped onto the engine's raster-op selector.
These are the hooks that displace the ROM's 68040 MOVE16 fast copy. The patch carries the same eligibility gate, then
puts the card into $7F fast-block mode and runs the aligned transfer through the active bank:
$05B6 MOVE.B #$7F,$D40403(,D1.L*1) ; latch fast-block copy
$0612 ADDA.L #$400000,A5 ; stream through the active apertureAnything not aligned, not 32+ longs wide, or not headed to a card falls through to the ROM's own 040 path.
These come from RgnBlt: blits clipped to an arbitrary region, where each destination scanline is described by a
run-encoded instruction stream (skip / count / mask-flag words). The patches keep the ROM's run-decode logic
verbatim and swap only the inner fill: a solid masked run streams through the active bank (hardware fill); a fragmented
run falls back to software.
rMASK0 (the 1-bit case) is the one genuinely subtle hook, because it is installed twice, and the order matters:
- First, the software pass patches the
QDPA 1body (a self-contained region-masked 1-bit fill, no card access) directly onto trap$AB5A, saving no chain-back pointer — it's terminal. - Then the hardware pass patches the
QCODrMASK0body on top, and does capture the displaced handler first — which is now theQDPA 1software body.
; software pass (installer):
$021E _GetResource('QDPA',1) / _DetachResource / _HLock / _StripAddress → A0
$0242 MOVE.W #$AB5A,D0 / _SetToolTrapAddress ; rMASK0 ← QDPA 1 (software)
; hardware pass then layers QCOD's rMASK0 on top, saving QDPA 1 as its fallback.The upshot: rMASK0 is the only hook that degrades hardware → optimized software (with no ROM tier beneath
it). Every other QCOD hook degrades hardware → ROM. It's a small masterpiece of trap-table stacking, and easy to miss.
Scaling is split three ways:
-
FastSlabMode(hardware) — picks a solid-fill scanline routine for the line/polygon path and programs the card so the chosen "slab" (a horizontal solid run) is filled in silicon. Solid fills are the ideal hardware case. -
SetUpStretch(hardware, System 7.0+ only) — latchesCONTROL=$03once at the start of a scaled blit and sets up scaling state. Its trap-patch table entry lives past the main list's terminator and is reached only through a version-gated secondary dispatch: on System 6 it isn't installed and StretchBits uses the ROM setup (the per-row fills still accelerate). -
stScanLoop(pure software) — the per-output-row DDA that steps source and destination by the scaling ratio and calls the horizontal-scale and mode-case routines (which may be the hardware bodies above):
$0004 CMP.W -$208(A6),D0 ; compare to the denominator
$000E DIVU.W -$208(A6),D1 ; scaling division
$003C ADD.L D0,-$1FC(A6) ; advance the source address by SRCROW
... add denom to the error term, step source while error ≤ 0, emit one dst row ...So the stretch path is a software orchestrator fanning out to hardware fills — the CPU walks the scaling geometry, the card does the pixel work.
It's tempting to be underwhelmed. The 24AC is not an autonomous coprocessor. There is no display list, no command queue, no "draw this polygon" that runs while the CPU does something else. The 68K still executes the entire outer structure of every blit — it walks the scanlines, computes the masks, decodes the region runs, runs the stretch DDA. So where does the speedup come from?
1. The per-pixel inner loop is the expensive part, and that's what moves to silicon. Think about a patterned fill in
software. For every destination longword the CPU must: load the pattern, load the destination, mask, combine, store —
call it 5–6 instructions per 4 bytes, each store a round trip to VRAM. A hardware fill collapses that to one write per
stripe: write [dst+0x400000] = flag|L and the card paints L bytes itself. A 640-byte scanline that was ~160
read-modify-write longword iterations becomes a couple of stripe writes. The CPU's job shrank from O(pixels) to
O(stripes).
2. Writes go to fast VRAM through the card's own datapath, not the CPU's. Even the pure-software bMAIN0 copy
benefits simply because A5 points at card VRAM and the loop is tighter. When the engine fills, the card's memory
controller does the burst into VRAM — the CPU issues one command write and is done, rather than pushing every pixel
across the NuBus one longword at a time.
3. The setup cost is amortised, and gated. Programming the engine (latch mode, load pattern, commit) costs a handful of writes. That only pays off if the blit is big enough — which is exactly why the gate rejects anything under 32 longs and sends it to software. The card is smart about when not to bother. A pure-software card can't make that trade because it has nothing to trade against; the 24AC picks the cheaper of two real options per blit.
4. Pattern/colour caching removes redundant reloads. The one-entry operand cache (CMP.L (APER),new / BEQ skip)
means a run of same-coloured fills — the common case for desktop and window backgrounds — reloads the engine once, not
per fill.
Now the honest bounding. This design cannot help with:
- Small blits — the setup cost dominates, so they're rejected to software. Selection XOR, text carets, tiny UI rectangles: all CPU.
-
Unaligned or bit-shifted copies — the general
bMAIN0/bEND0path stays software (just a tighter loop); only aligned card-to-card copies reach the$7Fhardware block-copy (ScrollRect, the 040 fast-copy). - Ragged edges — the partial-longword left/right fringes of every fill are masked in software; only the mask-free interior is hardware. A tall, thin fill is nearly all edge and barely accelerates.
- 24-bit mode — no aperture access, so no acceleration at all.
- Geometry — region decoding, stretch DDA, clipping: all still on the CPU.
Pulling it together:
The Apple Macintosh Display Card 24AC is a smart framebuffer: a 24-bit colour NuBus card whose VRAM is mapped twice — once as plain pixels, once as a hardware-transforming alias — with a three-register engine (mode latch, operand, transforming write window) that fills and copies runs of pixels on command. Its control panel patches thirteen of QuickDraw's dispatchable bottleneck hooks: the colour fill/pattern/XOR/region/stretch hooks drive the engine; the basic copy and the stretch DDA are hand-optimised software; and every accelerated path is gated and degrades transparently to the code it replaced. It is not a coprocessor — the CPU drives every scanline — but by moving the per-pixel raster-op and pattern replication into silicon behind a memory-mapped window, it wins most of the available speedup at a fraction of a coprocessor's cost, and it can never draw a wrong pixel.
| Hook (trap) | Body | Resource | Hardware? | Degrades to |
|---|---|---|---|---|
bMAIN0 ($AB30) |
QDPA 2 +6 |
QDPA | no (SW) | terminal |
bEND0 ($AB40) |
QDPA 2 +0 |
QDPA | no (SW) | terminal |
stScanLoop ($AB99) |
QDPA 0 +0 |
QDPA | no (SW) | terminal |
bSETUP8 ($AB34) |
QCOD 0 +$01C2 |
QCOD | yes | ROM |
bXMAIN8 ($AB38) |
QCOD 0 +$1010 |
QCOD | yes | ROM |
bSetup0 ($AB58) |
QCOD 0 +$052A |
QCOD | yes | ROM (040 MOVE16) |
bLeft0 ($AB59) |
QCOD 0 +$06B0 |
QCOD | yes | ROM |
rMASK0 ($AB5A) |
QCOD 0 +$084C over QDPA 1
|
QCOD→QDPA | yes | software (QDPA 1) |
rMASK8 ($AB5E) |
QCOD 0 +$035A |
QCOD | yes | ROM |
rXMASK8 ($AB62) |
QCOD 0 +$0BAA |
QCOD | yes | ROM |
slXMASK8 ($AB84) |
QCOD 0 +$0DEA |
QCOD | yes | ROM |
FastSlabMode ($AB0C) |
QCOD 0 +$11F4 |
QCOD | yes | ROM |
SetUpStretch ($AB24) |
QCOD 0 +$1432 |
QCOD | yes (Sys 7.0+) | ROM |
Capability bitmap: QCOD+0x00 (BTST slot-index)
Per-slot aperture table: QCOD+0x68[slot] = 0xFs0FE000 | 0xFs3FE000
STATUS (read): slotBase + 0xD00402 ([2:0] depth → stride table; [3] class)
CONFIG (read, init): slotBase + 0xD40402 ([0] geometry variant)
CONTROL (write): slotBase + 0xD40403 (01=fill 03=stretch 7F=copy 00..3F=ROP)
Passive framebuffer: slotBase + 0x000000 (plain pixels; QuickDraw baseAddr)
Active (engine) bank: dest + 0x400000 (writes are transformed by CONTROL)
Operand aperture: slotBase + 0x0FE000|0x3FE000 (load operand; +0x400000 write 4 = commit)
32-bit-mode gate: lowmem MMU32bit ($0CB2) must be non-zero
Fill : write [dst+0x400000] = 0x40000000 | L ; fills L bytes with the operand
Copy : write [src+0x400000] = L ; latch source (no store)
write [dst+0x400000] = 0x40000000 | L ; copy L bytes; source auto-advances by L