Inside the Apple Macintosh Display Card 24AC

1. Introduction

The Apple Macintosh Display Card 24AC is a 24-bit colour NuBus display card, manufactured by Radius and white-labelled by Apple, whose headline feature is a QuickDraw accelerator. One thing to note here is that Apple's Tech Info Library does say "QuickDraw Acceleration" — it doesn't explicitly say hardware acceleration, which is a slight hint of things to come.

There is, as far as I've been able to find, no real documentation available on the internet regarding the hardware of the card, so we can mainly learn about this hardware by studying the software written for and running with the card. That software mainly boils down to

NuBus declaration ROM. Every NuBus card carries one. On this card it holds the ordinary video driver — the code the Slot Manager and the boot ROM use to bring the card up as a screen: monitor sensing, mode/depth selection, the CLUT/RAMDAC, VBL interrupts, and the linear framebuffer. Nothing out of the ordinary, and really no clues regarding the "acceleration" itself.

An INIT/cdev (the control panel). This is loaded when the System is started, and it is the entire subject of this article — the acceleration lives here. In this case I've used the Apple-branded INIT/cdev, but I strongly believe that the Radius-branded software is very, very similar.

Inside the INIT/cdev, there are basically three types of resources that seem to hold code — the expected INIT, as well as two code resource types I haven't really encountered before:

Resource	Role
`INIT 0/1/2`	the installer — startup gating, card discovery, trap-table patching, boot splash for the init
`QCOD 0`	the hardware-assisted engine — self-init, a per-slot aperture table, the trap-patch table, and ten hardware "hook" bodies
`QDPA 0/1/2`	the pure-software bodies — a stretch DDA, a region-masked 1-bit fill, and the basic copy loop

So from our perspective in this text, the two code resources that matter are QCOD and QDPA. Between them they carry thirteen replacement routines for QuickDraw's innermost drawing loops. That is the whole accelerator. Before digging into the details, let's briefly look at the QuickDraw architecture.

2. Background: QuickDraw's replaceable bottlenecks

QuickDraw's public calls like CopyBits, FillRect, and so on (as well as internal calls like StretchBits) are not monolithic. Internally they funnel down to a small set of "bottleneck" primitives: the tightest inner loops that actually move and combine pixels. And Apple deliberately dispatched those inner loops through the trap table, the same numbered-vector mechanism used for Toolbox calls, even though these internal primitives are "private" to QuickDraw. The interesting result is of course that they can be replaced/patched like any "public" trap.

The full set of these replaceable hooks is below — the routine, its trap number (its slot in the dispatch table), and the trap word (the instruction that invokes it):

Hook (routine)	Trap #	Trap word	What it does
`FastSlabMode`	`$30C`	`$AB0C`	solid-slab fill (line/polygon path)
`SetUpStretch`	`$324`	`$AB24`	stretch / scale setup
`bMAIN0`	`$330`	`$AB30`	basic copy, main loop
`bSETUP8`	`$334`	`$AB34`	patterned fill, 8 bpp+
`bXMAIN8`	`$338`	`$AB38`	XOR / transfer-mode fill
`bEND0`	`$340`	`$AB40`	basic copy, tail
`bSetup0`	`$358`	`$AB58`	68040 fast copy, setup
`bLeft0`	`$359`	`$AB59`	68040 fast copy, left edge
`rMASK0`	`$35A`	`$AB5A`	region-masked 1-bit fill
`rMASK8`	`$35E`	`$AB5E`	region-masked 8 bpp fill
`rXMASK8`	`$362`	`$AB62`	region-masked XOR fill
`slXMASK8`	`$384`	`$AB84`	scaled masked fill
`stScanLoop`	`$399`	`$AB99`	stretch scanline DDA

As far as I can see, BitBlt, RgnBlt and StretchBits invoke these by trap (_bMAIN0, _rMASK8, …) rather than by a direct JSR. Because a trap is just a table entry, any INIT can call _SetToolTrapAddress and substitute its own routine — and QuickDraw will happily call the replacement.

The beauty here is of course that if we "accelerate" these thirteen routines, we are basically accelerating a number of different higher-level routines like BitBlt, RgnBlt, StretchBits and so on.

The names encode a taxonomy worth internalising, because it tells you at a glance what each hook does and which will be worth accelerating:

…0 vs …8 — pixel depth. …0 is the 1-bit / monochrome family; …8 is the chunky 8-bit-and-deeper (colour) family.
b… — the plain rectangular blit (BitBlt).
r… — the region-masked (arbitrary-shape-clipped) variants (RgnBlt), driven by a run-encoded instruction stream.
X — the XOR / "big pattern" (indexed-offset pattern) variant.
sl… — a scaled (stretch) masked variant.

Two consequences fall straight out of this taxonomy, and they define the whole strategy of the card:

The colour (…8) fill/pattern/XOR/region hooks are where a colour card spends its time, so those are the ones worth putting in silicon.
A plain CopyBits of already-formed pixels (bMAIN0/bEND0) is not a fill — there's no pattern to replicate, no raster-op to apply — so, as we'll see, the card doesn't hardware-accelerate it at all.

3. The actual "acceleration" hardware

The engine the control panel drives is simple enough to describe in full: a slot of NuBus address space, three registers, and one clever piece of address aliasing. Here is the whole thing.

3.1 Slot space and the 32-bit-mode gate

The card is an ordinary NuBus card in some slot. Write its slot number as s — a single hex digit in the range $9–$E on these machines. That slot's address space begins at 0xFs000000, where the s is literally the slot digit: slot $9 lives at 0xF9000000, slot $E at 0xFE000000. (The driver computes it by masking the card's baseAddr with 0xFF000000.) So 0xFs000000 throughout this article means "the base of whichever slot the card is in," and every derived address — 0xFs0FE000 for the operand aperture, and so on — carries the same digit. Crucially, that slot space is only directly addressable when the CPU is in 32-bit addressing mode. On a 24-bit Mac (or a Mac booted in 24-bit Memory Manager mode), the top byte of every address is stripped, so 0xFs…… is unreachable without a mode switch.

It's worth being precise about what that costs, and when. Installing the accelerator needs no 32-bit mode at all — the trap-table patches are ordinary low-RAM writes — so the cdev loads and installs on a 24-bit boot just fine. The only part of setup that has to reach the card is a one-time register probe, and it gets there by temporarily flipping to 32-bit with _SwapMMUMode, then flipping straight back:

$0122  MOVEQ #1,D0 / _SwapMMUMode          ; enter 32-bit mode
$0126  MOVE.W $D00402(A0),D4               ; read STATUS
$012E  MOVE.B $D40402(A0),D3               ; read CONFIG
$0136  _SwapMMUMode                        ; restore prior mode

That bracket is not a requirement that the machine be running in 32-bit mode — it is simply how you reach a NuBus card's 32-bit slot address from a 24-bit context, so the probe (like the install) runs regardless of the boot mode. The per-blit path is the opposite trade-off: the engine can only be driven when the aperture is already reachable, and the hook can't afford a _SwapMMUMode on every scanline of every blit. So it does something cheaper: it checks the low-memory MMU32bit flag ($0CB2) and, if the machine isn't already in 32-bit mode, it simply doesn't accelerate:

$01E0  TST.B  $0CB2                         ; MMU32bit — already 32-bit?
$01E4  BEQ    $0354                         ; no → can't reach the aperture → chain to ROM

This is the first of many eligibility gates we'll see. The card only accelerates when acceleration is free of overhead; otherwise it steps aside. (It also means that on a stock 24-bit System 6 machine the engine is essentially dormant and QuickDraw uses the CPU — a fact that shapes how much the card ever helped in practice.)

3.2 The register map

The card exposes two register longwords high in slot space, of which the driver uses three bytes:

Slot-relative address	Access	Role
`+$D00402`	read	STATUS — low 3 bits = current pixel-depth code; bit 3 = card-class/VRAM-organisation
`+$D40402`	read (init only)	CONFIG — bit 0 = geometry variant
`+$D40403`	write	CONTROL / MODE — latches what the next aperture writes will do

STATUS[2:0] is read live at the top of every accelerated blit and used to index an 8-entry stride table (the current rowBytes for the depth the user last picked in the Monitors panel). CONTROL ($D40403) is the command register — writing to it selects the engine's operating mode:

Value written to `$D40403`	Meaning
`$01`	pattern / solid fill — replicate the latched operand
`$03`	stretch / scale
`$7F`	fast block copy (all planes, straight transfer)
computed `$00`–`$3F`	raster-op / transfer-mode derived from the QuickDraw mode

3.3 The dual-aperture trick — the heart of the design

Here is the single most important architectural idea, and it's beautifully simple. The card maps its framebuffer twice, four megabytes apart:

The passive bank at offset 0 is plain VRAM. Reads and writes here are just pixels — this is the baseAddr QuickDraw hands to the hooks.
The active bank at offset +0x400000 is a hardware-transforming alias of the same cells. A write through this window is not stored verbatim; the engine interprets it according to the mode last latched into $D40403.

   passive bank                          active bank (alias, +0x400000)
   0x000000  ─ plain pixels ─┐           0x400000  ─ writes are TRANSFORMED ─┐
             ...             │  same        ...     (engine applies the mode) │
   0x3FFFFF                  ┘  VRAM      0x7FFFFF                            ┘

Plus a small operand aperture near the top of VRAM (0x0FE000 on a small-VRAM card, 0x3FE000 on a large one), with its own +0x400000 commit alias. That's where you load the pattern/colour the fill engine will replicate.

So the whole engine has just three moving parts: a mode latch ($D40403), an operand register (loaded through the operand aperture and committed), and a transforming write window (dest + 0x400000). No command FIFO, no DMA descriptors, no completion interrupt. Everything is driven by plain CPU stores to memory-mapped windows, and — as far as the driver is concerned — the engine is synchronous: it never polls a "busy" bit before or after an operation.

3.4 Finding the cards: the per-slot aperture table

QCOD's one-time self-init walks the Slot Manager's device list, finds every slot holding a 24AC, reads that card's STATUS/CONFIG to pick geometry, and builds a table of per-slot aperture base addresses at QCOD+0x68:

$00D4  MOVEQ  #8,D1                         ; slot loop 9..E
loop:  ADDQ #1,D1 / BTST D1,D2              ; D2 = bitmap of slots holding the card
$0116  MOVE.L $2A(A1),D0                    ; card baseAddr from the slot device record
$011A  ANDI.L #$FF000000,D0                 ; → slot base 0xFs000000
       ... read STATUS/CONFIG, pick geometry constants ...
$01A8  MOVE.L #$F0,D0 / OR.W D1,D0 / ROR.L #8,D0   ; build 0xFs000000
$01B2  OR.L  D0,D4                          ; + the operand-aperture offset
$01B4  MOVE.L D4,$0(A2,D1.W*4)              ; store per-slot aperture base

The result: QCOD+0x68[slot] holds 0xFs0FE000 (or 0xFs3FE000). Alongside it, QCOD+0x00 holds a capability bitmap — one bit per slot that has a 24AC — so that every hook can, in two instructions, decide "is this blit even going to one of my cards?"

4. The engine's instruction set

The engine speaks a tiny language of memory writes. There are exactly three verbs: load an operand, fill, and copy. Everything the card accelerates is built from these.

4.1 Load-and-commit an operand

Before a pattern or solid fill, you load the 32-bit operand (the pattern longword or fill colour) through the operand aperture and commit it with a magic write of 4 through the aperture's own +0x400000 window — done twice, presumably for timing/handshake:

write CONTROL($D40403) = $01                 ; fill mode
write LONG  [APER]            = OPERAND       ; the pattern long / fill colour
write LONG  [APER + 0x400000] = 4             ; commit  (driver writes it TWICE)
write LONG  [APER + 0x400000] = 4
; thereafter reading [APER] returns OPERAND — the driver reads it back for edges

The driver keeps a one-entry cache: CMP.L (APER),newOperand / BEQ skip. If the pattern longword hasn't changed since the last fill, it skips the reload entirely. Small, but it matters — most fills in a row use the same colour.

4.2 Fill — run-length writes through the active bank

With an operand latched, the body of each scanline is filled by writing run lengths into the active bank:

A = destPixelAddr + 0x400000                 ; active-bank alias of the dest position
loop over the middle of the scanline:
    L = min(bytesRemaining, stripeWidth)     ; clamp to the engine's stripe limit
    write LONG [A] = $40000000 | L           ; engine fills L bytes here with the operand
    A += L
    bytesRemaining -= L

Note the value written: $40000000 | L. This is a subtle and load-bearing detail. Bit 30 is a command/enable flag that sits above the run-length counter; the actual byte count is the low bits. Take the whole longword as the count and the fill runs to the end of VRAM — a spectacular full-screen smear (and, because a pull-down menu's fills are narrow and bottom-up, exactly the kind of mistake that surfaces as white-and-black bands across the desktop rather than an obvious crash). The count is written & ~$40000000; nothing more.

4.3 Copy — a two-write handshake with an auto-incrementing source

The copy path (CONTROL=$7F, used by ScrollRect and window CopyBits) is the most interesting, and the place where a naïve model goes wrong twice.

It is not "stream the source pixels through the window." It is a two-write handshake per stripe:

write CONTROL($D40403) = $7F
for each scanline stripe:
    write LONG [srcPos + 0x400000] = L                  ; no flag → LATCH source = srcPos, len = L
    write LONG [dstPos + 0x400000] = $40000000 | L      ; flag   → COPY L bytes srcPos → dstPos

The flagless write latches a source position and length (it stores nothing); the bit-30-flagged write executes the copy from the latched source to this destination — a source-active store followed by a dest-active store:

MOVE.L  D4,(A4)          ; write to src+0x400000, no flag  → latch source
MOVE.L  D5,(A5)          ; write to dst+0x400000, flag set → copy

And there's a second subtlety worth pinning down: the latched source is an auto-incrementing pointer. After each execute it advances by the number of bytes copied, so a single latch can feed several executes. The driver exploits this whenever a copy's destination would straddle an 8 KB VRAM boundary (dst & 0x1FFF): it splits the copy at the boundary into two executes from one latch, supplying only the second destination and letting the engine continue the source itself:

LATCH src=$a788  len=512
COPY  dst=$7f88  len=120           ; 0x7f88 + 120 = 0x8000  (crosses the 8 KB line)
COPY  dst=$8000  len=392           ; SAME latch — source auto-advances 0xa788 → 0xa800

Treat the source as pinned to the latched value and the second execute re-reads the first 120 bytes of the source: you get a faint, regular dotting along every boundary-crossing scanline — invisible on a page-jump scroll (whose copies rarely straddle a boundary) but glaringly obvious when a text window is dragged down one line at a time. The source pointer must advance by L after every execute.

4.4 Stretch

The SetUpStretch hook latches CONTROL=$03 once at the start of a scaled blit; the per-output-row work is then driven by a software digital differential analyser (stScanLoop) that calls the fill/copy verbs above for each row. No new engine registers — stretch is just fill/copy under a scaling loop.

That's the entire hardware interface: a mode latch, an operand register, and two transforming write verbs (fill and copy), each addressed through a +0x400000 window. Now let's see how the QuickDraw hooks use it.

5. The patches, routine by routine

The thirteen hooks split cleanly across the two code resources by strategy:

Resource	Hooks	Strategy
`QDPA` 0/1/2	`bMAIN0`, `bEND0`, `rMASK0` (1-bit), `stScanLoop`	pure software — tighter 68K loops, no card registers touched
`QCOD` 0	`bSETUP8`, `bXMAIN8`, `bSetup0`, `bLeft0`, `rMASK8`, `rXMASK8`, `slXMASK8`, `FastSlabMode`, `SetUpStretch`, (`rMASK0` hw)	hardware-assisted — each programs `$D40403` and streams through `+0x400000`

The split is mechanical: QDPA 0/1/2 contain zero card-register accesses; every QCOD body reads $D00402 and writes $D40403.

Every QCOD (hardware) body has the same skeleton:

1. ELIGIBILITY GATE      → on ANY failure: BRA reject → chain to the routine we displaced
     • destination is one of our cards      (top byte of dest vs the capability bitmap)
     • width D2 (LONGCNT) ≥ 0x20            (≥ 32 longs — big enough to be worth offloading)
     • MMU32bit ($0CB2) set                 (32-bit mode — aperture reachable)
     • per-slot aperture base ≠ 0           (card actually present)
     • dest row-bump longword-aligned       (A3 & 3 == 0)
2. DERIVE slot base = destBase & 0xFF000000 ; index the QCOD+0x68 table
3. PROGRAM the engine: read STATUS → stride; write CONTROL; load the operand
4. STREAM the mask-free middle through +0x400000; do masked EDGES in software
5. advance per scanline (via the BitBlt A6 frame vars), loop, RTS

The QDPA (software) bodies skip steps 2–4 and just run a tight CPU loop. Let's look at both kinds.

5.1 The "acceleration" that touches no hardware: `bMAIN0` / `bEND0`

A small surprise: QuickDraw's generic copy bottleneck — the basic longword loop bMAIN0/bEND0 — is left as a pure-software body in QDPA 2, touching no card registers. That is not because copies can't be accelerated — the $7F mode is a hardware straight-copy, and the card does route the alignment-friendly cases to it (window ScrollRect through the region path, §5.4, and the 68040 fast-copy bSetup0/bLeft0, §5.3). It's that this generic hook must also cover the general case — arbitrary bit-shifted, unaligned, possibly off-card-source copies — which the alignment-restricted hardware path can't take, so it's left as an optimised CPU loop. The card's contribution here is just a better loop.

Compare the ROM's inner loop with the patch. The ROM's basic-copy loop bounces back out to BitBlt's shared NXTSRC between scanlines:

bMAIN0A BFEXTU  (A4){D6:0},D0        ; read 32 shifted source bits
        ADD A0,A4 / EOR.L D7,D0 / AND.L D1,D0 / NOT.L D1
        AND.L (A5),D1 / OR.L D1,D0 / MOVE.L D0,(A5) / ADD A0,A5
        SUB #1,D2 / BEQ bEND0A / BLT NXTSRC   ; NXTSRC lives out in the BitBlt body

The patch internalises the entire scanline loop — inner and outer — so it never pays the per-scanline re-dispatch back into BitBlt:

; --- the patch's OWN per-scanline loop (QDPA 2) ---
$007C  ADDA.W -$25E(A6),A4          ; A4 += SRCBUMP        (frame var)
$0080  ADDA.W A3,A5                 ; A5 += dest row bump
$0082  SUBQ.W #1,-$260(A6)          ; HEIGHT--
$0086  BEQ.S  $0092                 ; done → RTS
$0088  MOVE.L -$26C(A6),D1          ; reload FIRSTMASK
$008C  MOVE.W -$26E(A6),D2          ; reload LONGCNT
$0090  JMP    (A1)                  ; straight back into the main loop — no re-dispatch
$0092  RTS

The only "hardware" involved is that A5 already points at fast card VRAM. The win is real but modest: fewer branches, no per-row trip through the dispatcher. It is emblematic of the whole card's philosophy — the CPU still walks every pixel; you just make it walk more efficiently and, where you can, hand the per-pixel work to silicon.

(Note also what the patch isn't: it's a plain 68020-class BFEXTU/MOVE.L loop, not a 68040 MOVE16 block copy. The card ships no 68040-tuned software blitter of its own — on a 040, the fast-copy hooks fall back to the ROM's own MOVE16 path.)

5.2 The hardware fill: `bSETUP8` / `bXMAIN8`

Now the interesting half. bSETUP8 is QuickDraw's patterned fill for 8-bit and deeper pixels. The ROM version expands the 8×8 pattern to 16 longwords and, per destination longword, does a read-modify-write to OR the masked pattern in — pure CPU, one longword at a time.

The patch replaces that RMW loop with hardware fill. Walking the body (QCOD+0x01C2):

; ---- eligibility gate (any fail → $0354 chain-to-ROM) ----
$01C2  MOVE.L -$F4(A6),D0 / ROL.L #8 / SUBI.W #$F8,D0 / BMI $0354   ; dest slot in range?
$01D0  BTST  D0,(QCOD+0) / BEQ $0354                                ; slot in the capability bitmap?
$01D8  CMPI.W #$20,D2 / BLT $0354                                   ; width ≥ 32 longs?
$01E0  TST.B $0CB2 / BEQ $0354                                      ; MMU32bit set?
; ---- derive this slot's aperture base ----
$01EE  MOVE.L A5,D3 / ROL.L #8 / ANDI.L #$FF,D3 / SUBI.W #$F0,D3    ; slot index from dest
$0202  MOVE.L $0(A0,D3.W*4),D3 / MOVE.L D3,(A0) / BEQ $0354          ; base; bail if 0
$0212  MOVE.L A3,D3 / ANDI.W #3,D3 / BNE $0354                       ; dest row-bump long-aligned?
; ---- per pattern longword: load the pattern into the card, then let it fill ----
$027C  MOVE.L (A4)+,D6                    ; next pattern long
$027E  CMP.L  (A6),D6 / BEQ $02AA         ; unchanged? skip reprogramming (1-entry cache)
$028C  MOVE.B #$1,$D40403(,D1.L*1)        ; latch CONTROL = fill
$0296  MOVE.L D6,(A6)                     ; write pattern long to the operand aperture
$0298  ADDA.L #$400000,A6                 ; → the commit window
$029E  MOVEQ #4,D1 / MOVE.L D1,(A6) / MOVE.L D1,(A6)   ; commit (twice)
; ---- left edge in software (masked), then the middle through the accel aperture ----
$02AA  MOVE.L D6,D0 / AND.L D4,D0 / AND.L (A5),D1 / OR.L D1,D0 / MOVE.L D0,(A5)+  ; masked edge long
$02D2  ADDA.L #$400000,A5                 ; switch dest to the active bank
       ... stream the scanline middle — the card replicates the loaded pattern ...

The shape is exactly the §4 skeleton: latch fill mode, load the pattern, commit, then write the mask-free middle of the scanline through dest + 0x400000 where silicon replicates the pattern into VRAM. The ragged left/right edges — the partial longwords that need per-pixel masking — are still done by the CPU through the passive bank, because the engine fills whole units, not masked fractions.

bXMAIN8 is the same skeleton for the XOR / transfer-mode case. It computes a raster-op code from the QuickDraw mode word and writes that to $D40403 instead of $01:

$107A  MOVE.B $D00402(,D3.L*1),D3 / ANDI.W #7,D3 / MOVE.W tbl(PC,D3*2),D3   ; STATUS → stride
$10AA  MOVE.B D0,$D40403(,D3.L*1)            ; write the COMPUTED transfer mode
$10F4  ADDA.L #$400000,A0                    ; second aperture
$10FE  MOVE.L $0(A4,D3.W*1),(A2)+ / DBF D1,… ; stream the run

The code is derived from the QuickDraw mode word D7 by if (D7 & 4) code = D7>>2; else code = ((D7<<1)+4)>>2 — i.e. QuickDraw's transfer mode is mapped onto the engine's raster-op selector.

5.3 The fast block copy: `bSetup0` / `bLeft0`

These are the hooks that displace the ROM's 68040 MOVE16 fast copy. The patch carries the same eligibility gate, then puts the card into $7F fast-block mode and runs the aligned transfer through the active bank:

$05B6  MOVE.B #$7F,$D40403(,D1.L*1)         ; latch fast-block copy
$0612  ADDA.L #$400000,A5                    ; stream through the active aperture

Anything not aligned, not 32+ longs wide, or not headed to a card falls through to the ROM's own 040 path.

5.4 The region-masked fills: `rMASK0` / `rMASK8` / `rXMASK8` — and a clever double-install

These come from RgnBlt: blits clipped to an arbitrary region, where each destination scanline is described by a run-encoded instruction stream (skip / count / mask-flag words). The patches keep the ROM's run-decode logic verbatim and swap only the inner fill: a solid masked run streams through the active bank (hardware fill); a fragmented run falls back to software.

rMASK0 (the 1-bit case) is the one genuinely subtle hook, because it is installed twice, and the order matters:

First, the software pass patches the QDPA 1 body (a self-contained region-masked 1-bit fill, no card access) directly onto trap $AB5A, saving no chain-back pointer — it's terminal.
Then the hardware pass patches the QCOD rMASK0 body on top, and does capture the displaced handler first — which is now the QDPA 1 software body.

; software pass (installer):
$021E  _GetResource('QDPA',1) / _DetachResource / _HLock / _StripAddress → A0
$0242  MOVE.W #$AB5A,D0 / _SetToolTrapAddress        ; rMASK0 ← QDPA 1 (software)
; hardware pass then layers QCOD's rMASK0 on top, saving QDPA 1 as its fallback.

The upshot: rMASK0 is the only hook that degrades hardware → optimized software (with no ROM tier beneath it). Every other QCOD hook degrades hardware → ROM. It's a small masterpiece of trap-table stacking, and easy to miss.

5.5 The stretch family: `FastSlabMode`, `SetUpStretch`, `stScanLoop`

Scaling is split three ways:

FastSlabMode (hardware) — picks a solid-fill scanline routine for the line/polygon path and programs the card so the chosen "slab" (a horizontal solid run) is filled in silicon. Solid fills are the ideal hardware case.
SetUpStretch (hardware, System 7.0+ only) — latches CONTROL=$03 once at the start of a scaled blit and sets up scaling state. Its trap-patch table entry lives past the main list's terminator and is reached only through a version-gated secondary dispatch: on System 6 it isn't installed and StretchBits uses the ROM setup (the per-row fills still accelerate).
stScanLoop (pure software) — the per-output-row DDA that steps source and destination by the scaling ratio and calls the horizontal-scale and mode-case routines (which may be the hardware bodies above):

$0004  CMP.W  -$208(A6),D0          ; compare to the denominator
$000E  DIVU.W -$208(A6),D1          ; scaling division
$003C  ADD.L  D0,-$1FC(A6)          ; advance the source address by SRCROW
       ...    add denom to the error term, step source while error ≤ 0, emit one dst row ...

So the stretch path is a software orchestrator fanning out to hardware fills — the CPU walks the scaling geometry, the card does the pixel work.

6. Why this "limited" design is still fast(er)

It's tempting to be underwhelmed. The 24AC is not an autonomous coprocessor. There is no display list, no command queue, no "draw this polygon" that runs while the CPU does something else. The 68K still executes the entire outer structure of every blit — it walks the scanlines, computes the masks, decodes the region runs, runs the stretch DDA. So where does the speedup come from?

1. The per-pixel inner loop is the expensive part, and that's what moves to silicon. Think about a patterned fill in software. For every destination longword the CPU must: load the pattern, load the destination, mask, combine, store — call it 5–6 instructions per 4 bytes, each store a round trip to VRAM. A hardware fill collapses that to one write per stripe: write [dst+0x400000] = flag|L and the card paints L bytes itself. A 640-byte scanline that was ~160 read-modify-write longword iterations becomes a couple of stripe writes. The CPU's job shrank from O(pixels) to O(stripes).

2. Writes go to fast VRAM through the card's own datapath, not the CPU's. Even the pure-software bMAIN0 copy benefits simply because A5 points at card VRAM and the loop is tighter. When the engine fills, the card's memory controller does the burst into VRAM — the CPU issues one command write and is done, rather than pushing every pixel across the NuBus one longword at a time.

3. The setup cost is amortised, and gated. Programming the engine (latch mode, load pattern, commit) costs a handful of writes. That only pays off if the blit is big enough — which is exactly why the gate rejects anything under 32 longs and sends it to software. The card is smart about when not to bother. A pure-software card can't make that trade because it has nothing to trade against; the 24AC picks the cheaper of two real options per blit.

4. Pattern/colour caching removes redundant reloads. The one-entry operand cache (CMP.L (APER),new / BEQ skip) means a run of same-coloured fills — the common case for desktop and window backgrounds — reloads the engine once, not per fill.

Now the honest bounding. This design cannot help with:

Small blits — the setup cost dominates, so they're rejected to software. Selection XOR, text carets, tiny UI rectangles: all CPU.
Unaligned or bit-shifted copies — the general bMAIN0/bEND0 path stays software (just a tighter loop); only aligned card-to-card copies reach the $7F hardware block-copy (ScrollRect, the 040 fast-copy).
Ragged edges — the partial-longword left/right fringes of every fill are masked in software; only the mask-free interior is hardware. A tall, thin fill is nearly all edge and barely accelerates.
24-bit mode — no aperture access, so no acceleration at all.
Geometry — region decoding, stretch DDA, clipping: all still on the CPU.

7. What the 24AC actually is

Pulling it together:

The Apple Macintosh Display Card 24AC is a smart framebuffer: a 24-bit colour NuBus card whose VRAM is mapped twice — once as plain pixels, once as a hardware-transforming alias — with a three-register engine (mode latch, operand, transforming write window) that fills and copies runs of pixels on command. Its control panel patches thirteen of QuickDraw's dispatchable bottleneck hooks: the colour fill/pattern/XOR/region/stretch hooks drive the engine; the basic copy and the stretch DDA are hand-optimised software; and every accelerated path is gated and degrades transparently to the code it replaced. It is not a coprocessor — the CPU drives every scanline — but by moving the per-pixel raster-op and pattern replication into silicon behind a memory-mapped window, it wins most of the available speedup at a fraction of a coprocessor's cost, and it can never draw a wrong pixel.

Appendix A — hook → trap → body map

Hook (trap)	Body	Resource	Hardware?	Degrades to
`bMAIN0` ($AB30)	`QDPA 2` +6	QDPA	no (SW)	terminal
`bEND0` ($AB40)	`QDPA 2` +0	QDPA	no (SW)	terminal
`stScanLoop` ($AB99)	`QDPA 0` +0	QDPA	no (SW)	terminal
`bSETUP8` ($AB34)	`QCOD 0` +$01C2	QCOD	yes	ROM
`bXMAIN8` ($AB38)	`QCOD 0` +$1010	QCOD	yes	ROM
`bSetup0` ($AB58)	`QCOD 0` +$052A	QCOD	yes	ROM (040 `MOVE16`)
`bLeft0` ($AB59)	`QCOD 0` +$06B0	QCOD	yes	ROM
`rMASK0` ($AB5A)	`QCOD 0` +$084C over `QDPA 1`	QCOD→QDPA	yes	software (QDPA 1)
`rMASK8` ($AB5E)	`QCOD 0` +$035A	QCOD	yes	ROM
`rXMASK8` ($AB62)	`QCOD 0` +$0BAA	QCOD	yes	ROM
`slXMASK8` ($AB84)	`QCOD 0` +$0DEA	QCOD	yes	ROM
`FastSlabMode` ($AB0C)	`QCOD 0` +$11F4	QCOD	yes	ROM
`SetUpStretch` ($AB24)	`QCOD 0` +$1432	QCOD	yes (Sys 7.0+)	ROM

Appendix B — engine quick reference

Capability bitmap:        QCOD+0x00              (BTST slot-index)
Per-slot aperture table:  QCOD+0x68[slot]        = 0xFs0FE000 | 0xFs3FE000
STATUS  (read):           slotBase + 0xD00402    ([2:0] depth → stride table; [3] class)
CONFIG  (read, init):     slotBase + 0xD40402    ([0] geometry variant)
CONTROL (write):          slotBase + 0xD40403    (01=fill  03=stretch  7F=copy  00..3F=ROP)
Passive framebuffer:      slotBase + 0x000000    (plain pixels; QuickDraw baseAddr)
Active (engine) bank:     dest     + 0x400000    (writes are transformed by CONTROL)
Operand aperture:         slotBase + 0x0FE000|0x3FE000   (load operand; +0x400000 write 4 = commit)
32-bit-mode gate:         lowmem MMU32bit ($0CB2) must be non-zero

Fill   : write [dst+0x400000] = 0x40000000 | L         ; fills L bytes with the operand
Copy   : write [src+0x400000] = L                      ; latch source (no store)
         write [dst+0x400000] = 0x40000000 | L         ; copy L bytes; source auto-advances by L

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inside the Apple Macintosh Display Card 24AC

Inside the Apple Macintosh Display Card 24AC

1. Introduction

2. Background: QuickDraw's replaceable bottlenecks

3. The actual "acceleration" hardware

3.1 Slot space and the 32-bit-mode gate

3.2 The register map

3.3 The dual-aperture trick — the heart of the design

3.4 Finding the cards: the per-slot aperture table

4. The engine's instruction set

4.1 Load-and-commit an operand

4.2 Fill — run-length writes through the active bank

4.3 Copy — a two-write handshake with an auto-incrementing source

4.4 Stretch

5. The patches, routine by routine

5.1 The "acceleration" that touches no hardware: `bMAIN0` / `bEND0`

5.2 The hardware fill: `bSETUP8` / `bXMAIN8`

5.3 The fast block copy: `bSetup0` / `bLeft0`

5.4 The region-masked fills: `rMASK0` / `rMASK8` / `rXMASK8` — and a clever double-install

5.5 The stretch family: `FastSlabMode`, `SetUpStretch`, `stScanLoop`

6. Why this "limited" design is still fast(er)

7. What the 24AC actually is

Appendix A — hook → trap → body map

Appendix B — engine quick reference

Uh oh!

Clone this wiki locally

Inside the Apple Macintosh Display Card 24AC

Inside the Apple Macintosh Display Card 24AC

1. Introduction

2. Background: QuickDraw's replaceable bottlenecks

3. The actual "acceleration" hardware

3.1 Slot space and the 32-bit-mode gate

3.2 The register map

3.3 The dual-aperture trick — the heart of the design

3.4 Finding the cards: the per-slot aperture table

4. The engine's instruction set

4.1 Load-and-commit an operand

4.2 Fill — run-length writes through the active bank

4.3 Copy — a two-write handshake with an auto-incrementing source

4.4 Stretch

5. The patches, routine by routine

5.1 The "acceleration" that touches no hardware: bMAIN0 / bEND0

5.2 The hardware fill: bSETUP8 / bXMAIN8

5.3 The fast block copy: bSetup0 / bLeft0

5.4 The region-masked fills: rMASK0 / rMASK8 / rXMASK8 — and a clever double-install

5.5 The stretch family: FastSlabMode, SetUpStretch, stScanLoop

6. Why this "limited" design is still fast(er)

7. What the 24AC actually is

Appendix A — hook → trap → body map

Appendix B — engine quick reference

Uh oh!

Clone this wiki locally

5.1 The "acceleration" that touches no hardware: `bMAIN0` / `bEND0`

5.2 The hardware fill: `bSETUP8` / `bXMAIN8`

5.3 The fast block copy: `bSetup0` / `bLeft0`

5.4 The region-masked fills: `rMASK0` / `rMASK8` / `rXMASK8` — and a clever double-install

5.5 The stretch family: `FastSlabMode`, `SetUpStretch`, `stScanLoop`