Optimizing assembly code
Sometimes the simplest way to write something in assembly code isn't the best. All of your resources are limited: CPU speed, ROM size, RAM space, register use. You can rewrite code to use those resources more efficiently (sometimes by trading one for another).
Most of these tricks come from Jeff's GB Assembly Code Tips v1.0, WikiTI's Z80 Optimization page, z80 Heaven's optimization tutorial, and GBDev Wiki's ASM Snippets. (Note that Z80 assembly is not the same as GBZ80; it has more registers and some different instructions.)
WikiTI's advice fully applies here:
Note that the following tricks act much like a peephole optimizer and are the last optimization step: remember to first optimize your algorithm and register allocation before applying any of the following if you really want the fastest speed and the smallest code.
Also note that nearly every trick turns the code less understandable and documenting them is a good idea. You can easily forgot after a while without reading parts of the code.
Be warned that some tricks are not exactly equivalent to the normal way and may have exceptions on their use; comments warn about them. Some tricks apply to other cases, but again you have to be careful.
There are some tricks that are nothing more than the correct use of the available instructions on the Z80. Keeping an instruction set summary helps to visualize what you can do during coding.
(There's also a "cheat sheet" table of instructions summarizing their bytes, cycles, and affected flags, if you don't need a long listing of what each one does.)
Contents
-
8-bit registers
- Set
ato 0 - Increment or decrement
a - Invert the bits of
a - Rotate the bits of
a - Reverse the bits of
a - Set
ato some constant minusa - Set
ato one constant or another depending on the carry flag - Increment or decrement
awhen the carry flag is set - Divide
aby 8 (shiftaright 3 bits) - Divide
aby 16 (shiftaright 4 bits) - Set
ato some value plus or minus carry - Add or subtract the carry flag from a register besides
a - Load from HRAM to
aor fromato HRAM
- Set
-
16-bit registers
- Multiply
hlby 2 - Add
ato a 16-bit register - Subtract an 8-bit constant from a 16-bit register
- Set a 16-bit register to
aplus a constant - Set a 16-bit register to
a* 16 - Increment or decrement a 16-bit register
- Add or subtract the carry flag from a 16-bit register
- Load from an address to
hl - Exchange two 16-bit registers
- Subtract two 16-bit registers
- Load two constants into a register pair
- Load a constant into
[hl] - Increment or decrement
[hl] - Load a constant into
[hl]and increment or decrementhl
- Multiply
- Branching (control flow)
- Subroutines (functions)
- Jump and lookup tables
8-bit registers
Set a to 0
Don't do:
ld a, 0 ; 2 bytes, 2 cycles, no changes to flagsBut do:
xor a ; 1 byte, 1 cycle, sets flags C to 0 and Z to 1Or do:
sub a ; 1 byte, 1 cycle, sets flags C to 0 and Z to 1Don't use the optimized versions if you need to preserve flags. As such, ld a, 0 must be left intact in the code below:
ld a, [wIsTrainerBattle]
and a ; sets zero flag if [wIsTrainerBattle] == 0
ld a, 0 ; sets a to 0 without affecting zero flag
jr nz, .is_trainer_battle
; is not trainer battle
Increment or decrement a
When possible, avoid doing:
add 1 ; 2 bytes, 2 cycles; sets carry for -1 to 0 overflow sub 1 ; 2 bytes, 2 cycles; sets carry for 0 to -1 underflowIf you don't need to set the carry flag, then do:
inc a ; 1 byte, 1 cycle dec a ; 1 byte, 1 cycle
Invert the bits of a
Don't do:
xor $ff ; 2 bytes, 2 cyclesBut do:
cpl ; 1 byte, 1 cycle
Rotate the bits of a
Don't do:
rl a ; 2 bytes, 2 cycles; updates Z and C flags rlc a ; 2 bytes, 2 cycles; updates Z and C flags rr a ; 2 bytes, 2 cycles; updates Z and C flags rrc a ; 2 bytes, 2 cycles; updates Z and C flagsBut do:
rla ; 1 byte, 1 cycle; updates C flag rlca ; 1 byte, 1 cycle; updates C flag rra ; 1 byte, 1 cycle; updates C flag rrca ; 1 byte, 1 cycle; updates C flagThe exception is if you need to set the zero flag when the operation results in 0 for a; the two-byte operations can set z, the one-byte operations cannot.
Reverse the bits of a
(This optimization is based on Retro Programming).
(The example uses b and c, but any of d, e, h, or l would also work.)
Don't do:
; 26 bytes, 26 cycles
rept 8
rra ; nor rla
rl b ; nor rr b
endr
ld a, bAnd don't do:
; 17 bytes, 17 cycles
ld b, a
rlca
rlca
xor b
and $aa
xor b
ld b, a
rlca
rlca
rlca
rrc b
xor b
and $66
xor bBut do:
; 15 bytes, 15 cycles
ld b, a
rlca
rlca
xor b
and $aa
xor b
ld b, a
swap b
xor b
and $33
xor b
rrcaOr if you really want to optimize for size over speed, then don't do:
; 10 bytes, 59 cycles
ld bc, 8 ; lb bc, 0, 8
.loop
rra ; nor rla
rl b ; nor rr b
dec c
jr nz, .loop
ld a, bBut do:
; 8 bytes, 50 cycles
ld b, 1
.loop
rra
rl b
jr nc, .loop
ld a, bOr if you really want to optimize for speed over size, then do:
; 6 bytes, 12 cycles
; (4 bytes, 5 cycles if you don't need the push hl/pop hl)
push hl
ld h, HIGH(ReversedBitTable)
ld l, a
ld a, [hl]
pop hl ; 256 bytes; placed in ROM0 or the same ROMX section as the bit reversal
SECTION "ReversedBitTable", ROM0, ALIGN[8]
ReversedBitTable::
x = 0
rept 256
; http://graphics.stanford.edu/~seander/bithacks.html#ReverseByteWith32Bits
db LOW(((((x * $802) & $22110) | ((x * $8020) & $88440)) * $10101) >> 16)
x = x + 1
endr
Set a to some constant minus a
Don't do:
; 4 bytes, 4 cycles
ld b, a
ld a, FOOBAR
sub bBut do:
; 3 bytes, 3 cycles
cpl
add FOOBAR + 1("What's foobar?")
Set a to one constant or another depending on the carry flag
(The example sets a to CVAL if the carry flag is set (c), or NCVAL is the carry flag is not set (nc).)
Don't do:
; 6 bytes, 6 or 7 cycles
ld a, CVAL
jr c, .carry
ld a, NCVAL
.carryAnd don't do:
; 6 bytes, 6 or 7 cycles
ld a, NCVAL
jr nc, .no_carry
ld a, CVAL
.no_carryAnd if either is 0, don't do:
; 5 bytes, 5 cycles
ld a, CVAL ; nor NCVAL
jr c, .carry ; nor jr nc
xor a
.carryAnd if either is 1 more or less than the other, don't do:
; 5 bytes, 5 cycles
ld a, CVAL ; nor NCVAL
jr c, .carry ; nor jr nc
inc a ; nor dec a
.carryInstead use sbc a, which copies the carry flag to all bits of a. Thus do:
; 5 bytes, 5 cycles
sbc a ; if carry, then $ff, else 0
and CVAL - NCVAL ; $ff becomes CVAL - NCVAL, 0 stays 0
add NCVAL ; CVAL - NCVAL becomes CVAL, 0 becomes NCVALOr do:
; 5 bytes, 5 cycles
sbc a ; if carry, then $ff, else 0
and CVAL ^ NCVAL ; $ff becomes CVAL ^ NCVAL, 0 stays 0
xor NCVAL ; CVAL ^ NCVAL becomes CVAL, 0 becomes NCVALAnd if certain conditions apply, then do something more efficient:
| If this... | ...then do: |
|---|---|
|
|
; 1 byte, 1 cycle
sbc a ; if carry, then $ff, else 0 |
|
|
; 2 bytes, 2 cycles
ccf ; invert carry flag
sbc a ; if originally carry, then 0, else $ff |
|
|
; 2 bytes, 2 cycles
sbc a ; if carry, then $ff aka -1, else 0
inc a ; -1 becomes 0, 0 becomes 1 |
|
|
; 3 bytes, 3 cycles
sbc a ; if carry, then $ff, else 0
or NCVAL ; $ff stays $ff, $00 becomes NCVAL |
|
|
; 3 bytes, 3 cycles
sbc a ; if carry, then $ff, else 0
and CVAL ; $ff becomes CVAL, 0 stays 0 |
|
|
; 3 bytes, 3 cycles
sbc a ; if carry, then $ff aka -1, else 0
add NCVAL ; -1 becomes NCVAL - 1 aka CVAL, 0 becomes NCVAL |
|
|
; 3 bytes, 3 cycles
sbc a ; if carry, then $ff aka -1, else 0; doesn't change the carry flag
sbc -NCVAL ; -1 becomes NCVAL - 2 aka CVAL, 0 becomes NCVAL |
|
|
; 4 bytes, 4 cycles
ccf ; invert carry flag
sbc a ; if originally carry, then 0, else $ff
and NCVAL ; 0 stays 0, $ff becomes NCVAL |
|
|
; 4 bytes, 4 cycles
ccf ; invert carry flag
sbc a ; if originally carry, then 0, else $ff
or CVAL ; $00 becomes CVAL, $ff stays $ff |
|
|
; 4 bytes, 4 cycles
ccf ; invert carry flag
sbc a ; if originally carry, then 0, else $ff aka -1
add CVAL ; -1 becomes CVAL - 1 aka NCVAL, 0 becomes CVAL |
|
|
; 4 bytes, 4 cycles
ccf ; invert carry flag
sbc a ; if carry, then 0, else $ff aka -1; doesn't change the carry flag
sbc -CVAL ; -1 becomes CVAL - 2 aka NCVAL, 0 becomes CVAL |
Increment or decrement a when the carry flag is set
Don't do:
; 3 bytes, 3 cycles
jr nc, .ok
inc a
.ok ; 3 bytes, 3 cycles
jr nc, .ok
dec a
.okBut do:
adc 0 ; 2 bytes, 2 cycles sbc 0 ; 2 bytes, 2 cycles
Divide a by 8 (shift a right 3 bits)
Don't do:
; 6 bytes, 9 cycles
; (15 bytes, at least 21 cycles, counting the definition of SimpleDivide)
ld c, 8 ; divisor
call SimpleDivide
ld a, b ; quotientAnd don't do:
; 6 bytes, 6 cycles
srl a
srl a
srl aBut do:
; 5 bytes, 5 cycles
rrca
rrca
rrca
and %00011111
Divide a by 16 (shift a right 4 bits)
Don't do:
; 6 bytes, 9 cycles
; (15 bytes, at least 21 cycles, counting the definition of SimpleDivide)
ld c, 16 ; divisor
call SimpleDivide
ld a, b ; quotientAnd don't do:
; 8 bytes, 8 cycles
srl a
srl a
srl a
srl aBut do:
; 4 bytes, 4 cycles
swap a
and $f
Set a to some value plus or minus carry
(The example uses b and c, but any registers besides a would also work, including [hl].)
Don't do:
; 4 bytes, 4 cycles
ld b, a
ld a, c
adc 0 ; 4 bytes, 4 cycles
ld b, a
ld a, c
sbc 0And don't do:
; 4 bytes, 4 cycles
ld b, a
ld a, 0
adc c ; 4 bytes, 4 cycles
ld b, a
ld a, 0
sbc cBut do:
; 3 bytes, 3 cycles
ld b, a
adc c
sub b ; 3 bytes, 3 cycles
ld b, a
sbc c
add bAlso, don't do:
; 5 bytes, 5 cycles
ld b, a
ld a, N
adc 0 ; 5 bytes, 5 cycles
ld b, a
ld a, N
sbc 0And don't do:
; 5 bytes, 5 cycles
ld b, a
ld a, 0
adc N ; 5 bytes, 5 cycles
ld b, a
ld a, 0
sbc NBut do:
; 4 bytes, 4 cycles
ld b, a
adc N
sub b ; 4 bytes, 4 cycles
ld b, a
sbc N
add b(If the original value of a was not backed up in b, this optimization would not apply.)
Add or subtract the carry flag from a register besides a
(The example uses b, but any of c, d, e, h, or l would also work.)
Don't do:
; 4 bytes, 4 cycles
ld a, b
adc 0
ld b, a ; 4 bytes, 4 cycles
ld a, b
sbc 0
ld b, aAnd don't do:
; 4 bytes, 4 cycles
ld a, 0
adc b
ld b, a ; 4 bytes, 4 cycles
ld a, 0
sbc b
ld b, aBut do:
; 3 bytes, 3 or 4 cycles
jr nc, .no_carry
inc b
.no_carry ; 3 bytes, 3 or 4 cycles
jr nc, .no_carry
dec b
.no_carry
Load from HRAM to a or from a to HRAM
Don't do:
ld a, [hFoobar] ; 3 bytes, 4 cycles ld [hFoobar], a ; 3 bytes, 4 cyclesBut do:
ldh a, [hFoobar] ; 2 bytes, 3 cycles ldh [hFoobar], a ; 2 bytes, 3 cycles16-bit registers
Multiply hl by 2
Don't do:
; 4 bytes, 4 cycles
sla l
rl hBut do:
add hl, hl ; 1 byte, 2 cycles
Add a to a 16-bit register
(The example uses hl, but bc or de would also work.)
Don't do:
; 6 bytes, 6 cycles
add l
ld l, a
ld a, 0
adc h
ld h, aand don't do:
; 6 bytes, 6 cycles
add l
ld l, a
ld a, h
adc 0
ld h, aand don't do:
; 5 bytes, 5 cycles
add l
ld l, a
jr nc, .no_carry
inc h
.no_carryBut do:
; 5 bytes, 5 cycles, no labels
add l
ld l, a
adc h
sub l
ld h, aOr if you can spare another 16-bit register and want to optimize for size over speed, do:
; 4 bytes, 5 cycles
ld d, 0
ld e, a
add hl, deSubtract an 8-bit constant from a 16-bit register
(The example uses hl, but bc or de would also work.)
Don't do:
; 8 bytes, 8 cycles
ld a, l
sub FOOBAR
ld l, a
ld a, h
sbc 0
ld h, aBut do:
; 7 bytes, 7 or 8 cycles
ld a, l
sub FOOBAR
ld l, a
jr nc, .no_carry
dec h
.no_carry(This is a case of "Add or subtract the carry flag from a register besides a", applied to the high part of a 16-bit register.)
Or if you can spare another 16-bit register, do:
; 4 bytes, 5 cycles
ld de, -FOOBAR
add hl, de
Set a 16-bit register to a plus a constant
(The example uses hl, but bc or de would also work.)
Don't do:
; 7 bytes, 8 cycles; uses another 16-bit register
ld e, a
ld d, 0
ld hl, FooBar
add hl, deAnd don't do:
; 8 bytes, 8 cycles
ld hl, FooBar
add l
ld l, a
adc h
sub l
ld h, aAnd don't do:
; 8 bytes, 8 cycles
ld h, HIGH(FooBar)
add LOW(FooBar)
ld l, a
jr nc, .no_carry
inc h
.no_carryBut do:
; 7 bytes, 7 cycles
add LOW(FooBar)
ld l, a
adc HIGH(FooBar)
sub l
ld h, aAnd if the constant is 8-bit and nonzero (i.e. 0 < FooBar < 256), then do:
; 6 bytes, 6 cycles
sub LOW(-FooBar)
ld l, a
sbc a
inc a
ld h, aAnd if the constant is zero (i.e. FooBar == 0 and a + FooBar == a), then do:
; 3 bytes, 3 cycles
ld l, a
ld h, 0
Set a 16-bit register to a * 16
(The example uses hl, but bc or de would also work.)
You can do:
; 7 bytes, 11 cycles
ld l, a
ld h, 0
add hl, hl
add hl, hl
add hl, hl
add hl, hl ; 7 bytes, 11 cycles
ld l, a
ld h, 0
rept 4
add hl, hl
endrBut if a is definitely small enough, and its value can be changed, then do:
; 7 bytes, 10 cycles; sets a = a * 2; requires a < $80
add a
ld l, a
ld h, 0
add hl, hl
add hl, hl
add hl, hl ; 7 bytes, 9 cycles; sets a = a * 4; requires a < $40
add a
add a
ld l, a
ld h, 0
add hl, hl
add hl, hl ; 7 bytes, 8 cycles; sets a = a * 8; requires a < $20
add a
add a
add a
ld l, a
ld h, 0
add hl, hl ; 5 bytes, 5 cycles; sets a = a * 16; requires a < $10
swap a
ld l, a
ld h, 0Or if the value of a can be changed and you want to optimize for speed over size, do:
; 8 bytes, 8 cycles; sets a = l
swap a
ld l, a
and $f
ld h, a
xor l
ld l, aOr do:
; 8 bytes, 8 cycles; sets a = h
swap a
ld h, a
and $f0
ld l, a
xor h
ld h, aIncrement or decrement a 16-bit register
When possible, avoid doing:
inc hl ; 1 byte, 2 cycles dec hl ; 1 byte, 2 cyclesIf the low byte definitely won't overflow, then do:
inc l ; 1 byte, 1 cycle dec l ; 1 byte, 1 cycleThis is applicable, for instance, if you're reading a data table via hl one byte at a time, it has no more than 256 entries, and it's in its own SECTION which has been ALIGNed to 8 bits. It's unlikely to apply to pokecrystal's existing systems.
Add or subtract the carry flag from a 16-bit register
(The example uses hl, but bc or de would also work.)
Don't do:
; 8 bytes, 8 cycles
ld a, l ; nor ld a, 0
adc 0 ; nor adc l
ld l, a
ld a, h ; nor ld a, 0
adc 0 ; nor adc h
ld h, a ; 8 bytes, 8 cycles
ld a, l ; nor ld a, 0
sbc 0 ; nor sbc l
ld l, a
ld a, h ; nor ld a, 0
sbc 0 ; nor sbc h
ld h, aAnd don't do:
; 7 bytes, 7 cycles
ld a, l ; nor ld a, 0
adc 0 ; nor adc l
ld l, a
adc h
sub l
ld h, a ; 7 bytes, 7 cycles
ld a, l ; nor ld a, 0
sbc 0 ; nor sbc l
ld l, a
sbc h
add l
ld h, a(That would be applying the "Set a to some value plus or minus carry" optimization to part of the first way.)
And don't do:
; 7 bytes, 7 or 8 cycles
ld a, l ; nor ld a, 0
adc 0 ; nor adc l
ld l, a
jr nc, .no_carry
inc h
.no_carry ; 7 bytes, 7 or 8 cycles
ld a, l ; nor ld a, 0
sbc 0 ; nor sbc l
ld l, a
jr nc, .no_carry
dec h
.no_carry(That would be applying the "Add or subtract the carry flag from a register besides a" optimization to part of the first way.)
But do:
; 3 bytes, 4 or 5 cycles
jr nc, .no_carry
inc hl
.no_carry ; 3 bytes, 4 or 5 cycles
jr nc, .no_carry
dec hl
.no_carry
Load from an address to hl
Don't do:
; 8 bytes, 10 cycles
ld a, [Address] ; LSB first
ld l, a
ld a, [Address+1]
ld h, aBut do:
; 6 bytes, 8 cycles
ld hl, Address
ld a, [hli]
ld h, [hl]
ld l, aAnd don't do:
; 8 bytes, 10 cycles
ld a, [Address] ; MSB first
ld h, a
ld a, [Address+1]
ld l, aBut do:
; 6 bytes, 8 cycles
ld hl, Address
ld a, [hli]
ld l, [hl]
ld h, aExchange two 16-bit registers
(The example uses hl and de, but any pair of bc, de, or hl would also work.)
If you care about speed:
; 6 bytes, 6 cycles
ld a, d
ld d, h
ld h, a
ld a, e
ld e, l
ld l, aIf you care about size:
; 4 bytes, 9 cycles
push de
ld d, h
ld e, l
pop hlSubtract two 16-bit registers
(The example uses hl and de, but any pair of bc, de, or hl would also work.)
Don't do:
; 9 bytes, 10 cycles, modifies subtrahend de
ld a, $ff
xor d
ld d, a
ld a, $ff
xor e
ld e, a
add hl, deAnd don't do:
; 7 bytes, 8 cycles, modifies subtrahend de
ld a, d
cpl
ld d, a
ld a, e
cpl
ld e, a
add hl, deBut do:
; 6 bytes, 6 cycles
ld a, l
sub e
ld l, a
ld a, h
sbc d
ld h, aLoad two constants into a register pair
(The example uses bc, but hl or de would also work.)
Don't do:
; 4 bytes, 4 cycles
ld b, FOO
ld c, BARBut do:
ld bc, FOO << 8 | BAR ; 3 bytes, 3 cyclesOr better, use the lb macro in macros/code.asm:
lb bc, FOO, BAR ; 3 bytes, 3 cycles
Load a constant into [hl]
Don't do:
; 3 bytes, 4 cycles
ld a, FOOBAR
ld [hl], aBut do:
ld [hl], FOOBAR ; 2 bytes, 3 cycles
Increment or decrement [hl]
Don't do:
; 3 bytes, 5 cycles
ld a, [hl]
inc a
ld [hl], a ; 3 bytes, 5 cycles
ld a, [hl]
dec a
ld [hl], aBut do:
inc [hl] ; 1 bytes, 3 cycles dec [hl] ; 1 bytes, 3 cycles
Load a constant into [hl] and increment or decrement hl
Don't do:
; 2 bytes, 4 cycles
ld [hl], a
inc hl ; 2 bytes, 4 cycles
ld [hl], a
dec hlBut do:
ld [hli], a ; 1 bytes, 2 cycles ld [hld], a ; 1 bytes, 2 cyclesAnd if you can use a, then don't do:
; 3 bytes, 5 cycles
ld [hl], FOO
inc hl ; 3 bytes, 5 cycles
ld [hl], FOO
dec hlBut do:
; 3 bytes, 4 cycles
ld a, FOO
ld [hli], a ; 3 bytes, 4 cycles
ld a, FOO
ld [hld], aBranching (control flow)
Relative jumps
Don't do:
jp Somewhere ; 3 bytes, 4 cyclesBut do:
jr Somewhere ; 2 bytes, 3 cyclesThis only applies if Somewhere is within ±127 bytes of the jump.
Compare a to 0
Don't do:
cp 0 ; 2 bytes, 2 cyclesAnd don't do:
or 0 ; 2 bytes, 2 cyclesAnd don't do:
and $ff ; 2 bytes, 2 cyclesBut do:
or a ; 1 byte, 1 cycleOr do:
and a ; 1 byte, 1 cycle
Compare a to 1
cp 1 ; 2 bytes, 2 cyclesIf you don't care about carry or the value in a:
dec a ; 1 byte, 1 cycle, decrements aNote that you can still do inc a afterwards, which is one cycle faster if the jump is taken. Compare:
cp 1
jr z, .equals1with:
dec a
jr z, .equals1
inc a
Compare a to 255
(255, or $FF in hexadecimal, is the same as −1 due to two's complement.)
cp $ff ; 2 bytes, 2 cyclesIf you don't care about carry or the value in a:
inc a ; 1 byte, 1 cycle, increments aNote that you can still do dec a afterwards, which is one cycle faster if the jump is taken. Compare:
cp $ff
jr z, .equals255with:
inc a
jr z, .equals255
dec a
Compare a to 0 after masking it
Don't do:
; 3 bytes, 3 cycles; sets zero flag if a == 0
and MASK
and aBut do:
and MASK ; 2 bytes, 2 cycles; sets zero flag if a == 0
Test whether a is negative (compare a to $80)
If you don't need to preserve the value in a, then don't do:
; 4 bytes, 4/5 cycles
cp $80
jr nc, .negativeAnd don't do:
; 4 bytes, 4/5 cycles
bit 7, a
jr nz, .negativeInstead, do:
; 3 bytes, 3/4 cycles
rlca
jr c, .negativeSubroutines (functions)
Tail call optimization
Don't do:
; 4 bytes, 10 cycles
call Function
retBut do:
jp Function ; 3 bytes, 4 cycles
Call hl
Don't do:
; 5 bytes, 8 cycles
(some code)
ld de, .return
push de
jp hl
.return:
(some more code)But do:
; 3 bytes, 6 cycles
; (4 bytes, 7 cycles, counting the definition of _hl_)
(some code)
call _hl_
(some more code)_hl_ is a routine already defined in home/call_regs.asm:
_hl_::
jp hlInlining
Don't do:
; 4 additional bytes, 10 additional cycles
(some code)
call Function
(some more code)
Function:
(function code)
retif Function is only called a handful of times. Instead, do:
(some code)
; Function
(function code)
(some more code)You shouldn't do this if Function used any returns besides the one at the very end, or if inlining its code would make some jrs too distant from their targets.
Fallthrough
Don't do:
(some code)
call Function
ret
Function:
(function code)
retAnd don't do:
(some code)
jp Function
Function:
(function code)
retBut do:
(some code)
; fallthrough
Function:
(function code)
retFallthrough is what you get when you combine inlining with tail calls. You can still call Function elsewhere, but one tail call can be optimized into a fallthrough.
Conditional fallthrough
(The example uses z, but nz, c, or nc would also work.)
Don't do:
(some code)
jr z, .foo
jr .bar
.foo
(foo code)
.bar
(bar code)But do:
(some code)
jr nz, .bar
; fallthrough
.foo
(foo code)
.bar
(bar code)Conditional return
(The example uses z, but nz, c, or nc would also work.)
Don't do:
; 3 bytes, 3 or 6 cycles
jr z, .skip
ret
.skip
...But do:
; 1 byte, 5 or 2 cycles
ret nz
...Conditional call
(The example uses z, but nz, c, or nc would also work.)
Don't do:
; 5 bytes, 3 or 9 cycles
jr nz, .skip
call Foo
.skipBut do:
; 3 bytes, 6 or 3 cycles
call z, FooAnd don't do:
; 5 bytes, 3 or 9 cycles
jr nz, .skip
jp Foo
.skipBut do:
; 3 bytes, 6 or 3 cycles
jp z, Foo
Conditional rst $38
(The example uses z, but nz, c, or nc would also work.)
Don't do:
; 5 bytes, 3 or 14 cycles
call z, RstVector38
...
RstVector38:
rst $38
retAnd don't do:
; 3 bytes, 3 or 6 cycles
jr nz, .no_rst_38
rst $38
.no_rst_38
...And don't do:
; 3 bytes, 3 or 6 cycles
call z, $0038
...But do:
; 2 bytes, 2 or 7 cycles
jr z, @ + 1 ; the byte for @ + 1 is $ff, which is the opcode for rst $38
...(The label @ evaluates to the current pc value, which in jr z, @ + 1 is right before the jr instruction. The instruction consists of two bytes, the opcode and the relative offset. @ + 1 evaluates to in-between those two bytes. The jr instruction encodes its offset relative to the end of the instruction, i.e. the next pc value after the instruction has been read, so the relative offset is -1, aka $ff.)
Jump and lookup tables
Chain comparisons
Don't do:
cp 1
jr z, .equals1
cp 2
jr z, .equals2
cp 3
jr z, .equals3
...But do:
dec a
jr z, .equals1
dec a
jr z, .equals2
dec a
jr z, .equals3
...Or do:
dec a
ld hl, .jumptable
ld e, a
ld d, 0
add hl, de
add hl, de
ld a, [hli]
ld h, [hl]
ld l, a
jp hl
.jumptable:
dw .equals1
dw .equals2
dw .equals3
...Or better, do:
dec a
ld hl, .jumptable
rst JumpTable
...
.jumptable:
dw .equals1
dw .equals2
dw .equals3
...