You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
POKEGB: A gameboy emulator that only plays Pokémon Blue
I recently tweeted out an emulator that I made over the last few days:
In the video, I show some source code that looks like 3 pokéballs, compile it with gcc, and run it to play Pokémon Blue. The game is controllable, but has no sound. It renders the graphics w/ 12 colors (shades of red and blue). I choose BLUE as the name for our hero, and JERK as the name of his rival. The video ends when Professor Oak stops me from walking into the tall grass.
A lot of folks asked me to do a write-up of how this all works, so let's dive in!
(TL;DR: this is a very long writeup, check out the unobfuscated code instead if you prefer!)
Stats
But first, let's talk a little about some statistics. The final version in that tweet is actually 68 lines of code (I tweeted the wrong number!), each line less than 150 characters long, for a total of 9956 bytes. If you ignore whitespace and comments, it comes to 4720 bytes. That's too large for the International Obfuscated C Code Contest, but pretty close.
Prior to formatting the source code to look like pokéballs, the source was 188 lines and a total of 7786 bytes, with 5954 bytes of non-whitespace source.
Overview
It'll be useful to know a little about how the gameboy works. This won't go into all the details (for that you can read the pandocs), but will be enough to understand how this code works. Many values are easier to represent in hexadecimal, so I'll write those with a leading $, e.g. $FE00.
CPU
The gameboy CPU is a bit weird -- it's kind of like an Intel 8080 and kind of like a Zilog Z80, but not the same as either. It has an 8-bit accumulator A, and 3 16-bit register pairs, BC, DE, and HL, which can be individually accessed as the 8-bit registers B, C, D, E, H and L. It also has a 16-bit stack pointer SP, a 16-bit program counter PC.
16-bit
High byte
Low byte
Description
AF
A
-
Accumulator and flags
BC
B
C
General purpose
DE
D
E
General purpose
HL
H
L
General purpose / memory access
PC
-
-
Program counter
SP
-
-
Stack pointer
The cpu also has 4 flag bits: Z, N, H, and C:
Flag
Name
Description
Z
Zero flag
Set when the result is zero
N
Subtraction flag
Set if the last instruction was subtract, used for BCD
H
Half-carry flag
Set if there is an overflow out of bit 3, used for BCD
C
Carry flag
Set if there is an overflow out of bit 7
When accessing the AF register (which can only be done via push/pop instructions), the flags are stored in the most-significant bits of F:
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
A
A
A
A
A
A
A
A
Z
N
H
C
0
0
0
0
The gameboy instruction set has ~500 instructions, each up to 3 bytes long. For example, the NOP instruction is just one byte (00), but the JP instruction is 3 bytes: $C3 followed by a 16-bit address $50 $01.
The instructions can be grouped as follows (taken from peach.bot's excellent document on the Emulation Development discord). I've included all instructions, but have struck through the ones that are not needed for Pokémon Blue. Feel free to skip over this on first read, I'll go into more detail later:
7
6
5
4
3
2
1
0
Mnemonic
0
0
0
0
0
0
0
0
NOP
0
0
0
0
1
0
0
0
LD (u16), SP
0
0
0
1
0
0
0
0
STOP
0
0
0
1
1
0
0
0
JR
0
0
1
condition
0
0
0
JR condition
0
0
r16 (group 1)
0
0
0
1
LD r16, u16
0
0
r16 (group 2)
0
0
1
0
LD (r16), A
0
0
r16 (group 1)
0
0
1
1
INC r16
0
0
r16 (group 1)
1
0
0
1
ADD HL, r16
0
0
r16 (group 2)
1
0
1
0
LD A, (r16)
0
0
r16 (group 1)
1
0
1
1
DEC r16
0
0
r8
1
0
0
INC r8
0
0
r8
1
0
1
DEC r8
0
0
r8
1
1
0
LD r8, u8
0
0
opcode (group 1)
1
1
1
Opcode Group 1
0
1
1
1
0
1
1
0
HALT
0
1
r8 (destination)
r8 (source)
LD r8, r8
1
0
opcode (group 2)
r8 (operand 2)
ALU A, r8
1
1
0
condition
0
0
0
RET condition
1
1
0
condition
0
1
0
JP condition
1
1
0
condition
1
0
0
CALL condition
1
1
0
0
1
1
0
1
CALL u16
1
1
1
0
0
0
0
0
LD ($FF00 + u8), A
1
1
1
0
0
0
1
0
LD ($FF00 + C), A
1
1
1
0
1
0
0
0
ADD SP, i8
1
1
1
0
1
0
1
0
LD (u16), A
1
1
1
1
0
0
0
0
LD A, ($FF00 + u8)
1
1
1
1
0
0
1
0
LD A, ($FF00 + C)
1
1
1
1
1
0
0
0
LD HL, SP + i8
1
1
1
1
1
0
1
0
LD A, (u16)
1
1
r16 (group 3)
0
0
0
1
POP r16
1
1
opcode (group 4)
0
1
1
Opcode group 4
1
1
r16 (group 3)
0
1
0
1
PUSH r16
1
1
opcode (group 2)
1
1
0
ALU A, u8
1
1
EXP
1
1
1
RST (call to 00EXP000)
You can also use the CB prefix to access a different set of instructions:
7
6
5
4
3
2
1
0
Mnemonic
0
0
opcode (group 3)
r8
Shift/Rotate
0
1
bit
r8 (source)
BIT bit, r8
1
0
bit
r8 (source/dest)
RES bit, r8
1
1
bit
r8 (source/dest)
SET bit, r8
r8 is encoded as:
value
register
0
B
1
C
2
D
3
E
4
H
5
L
6
(HL)
7
A
r16 is encoded as:
value
group 1
group 2
group 3
0
BC
BC
BC
1
DE
DE
DE
2
HL
HL+
HL
3
SP
HL-
AF
condition is encoded as:
value
condition
Meaning
0
NZ
Zero flag is 0
1
Z
Zero flag is 1
2
NC
Carry flag is 0
3
C
Carry flag is 1
opcode is encoded as:
value
group 1
group 2
group 3
group 4
0
RLCA
ADD
RLC
JP
1
RRCA
ADC
RRC
CB prefix
2
RLA
SUB
RL
illegal
3
RRA
SBC
RR
illegal
4
DAA
AND
SLA
illegal
5
CPL
XOR
SRA
illegal
6
SCF
OR
SWAP
DI
7
CCF
CP
SRL
EI
As you can see, there are many instructions but we can group them into ~40 different categories, 5 of which aren't used by Pokémon Blue (as far as I know). (Turns out I was wrong, RRA is used; see the TryingToLearn function in the pokered source).
One additional complication is how long each of these instructions take to execute. This is measured in "T-states", which run at the frequency of the CPU, ~4.19MHz. All instructions operate on a multiple of 4 T-states, which is often called machine-cycles or "M-cycles".
It's important to keep track of how many M-cycles each instruction takes, since other hardware components (such as the graphics rendering and audio) take time too, and we want to make sure that the CPU and these components stay in sync. Fortunately, the M-cycle counts for most instructions is relatively simple — every memory access costs 1 M-cycle (including the cost of reading the instruction). Some instructions take additional M-cycles, but we can handle those cases explicitly.
Memory
The gameboy CPU has a 16-bit address space, divided up into regions with different purposes. Here's a simplified layout:
Low
High
Size (in bytes)
Purpose
$0000
$3FFF
16384
Cartridge ROM bank 0
$4000
$7FFF
16384
Cartridge ROM bank 1
$8000
$9FFF
8192
Video RAM
$A000
$BFFF
8192
External RAM (often battery-backed)
$C000
$DFFF
8192
Work RAM
$E000
$FDFF
7680
Mirror of C000-DEFF
$FE00
$FE9F
160
Object attribute memory (OAM)
$FEA0
$FEFF
96
Unused
$FF00
$FF7F
128
Memory-mapped I/O
$FF80
$FFFF
128
"High" RAM
Memory Bank Controllers
Gameboy cartridges often also use a Memory Bank Controller (MBC), which allows swapping out regions of ROM and RAM at runtime, as well as other features. Without this it would be impossible to access more than 32768 bytes of ROM.
Pokémon Blue uses MBC3, which has the ability to swap 64 different 16KiB banks of ROM (for a total of 1MiB), as well as 4 different of 8KiB banks of External Ram (for a total of 32KiB). The Pokémon Blue cartridge also has support for a real-time clock, but I didn't implement support for that.
The way you communicate with the MBC is by writing to various address regions in ROM (i.e. in the range $0000-$7FFF). See the following table for details. The features that are unimplemented in pokegb have been struck through:
Low
High
Purpose
$0000
$1FFF
External RAM and timer enable
$2000
$3FFF
ROM bank number
$4000
$5FFF
RAM bank number (or RTC register select)
$6000
$7FFF
Latch clock data
Reading and writing to external RAM can be enabled and disabled. I believe this was used because the CPU may write garbage data to memory when losing power (e.g. when the gameboy is switched off). We don't care about that, so we'll skip implementing it. Fortunately, as far as I can tell, the game doesn't ever read from or write to this region when it is disabled.
As mentioned before, the cartridge is 1MiB, but the CPU address space is only 64KiB. So to access the rest of the cartridge data we have to remap ROM. Only ROM bank 1 can be remapped; ROM bank 0 is always mapped to the cartridge's first 16KiB chunk.
The mapped memory region in ROM bank 1 can be changed by writing a value between 0 and 63 to any address in the region $2000-$3FFF. So for example, if we write the value 5 to address $2000, then the data mapped to addresses $4000-$7FFF will reference the sixth 16KiB ROM bank of the cartridge (which are the bytes in range $14000-$17FFF). The one exception is that if bank 0 is selected, then the cartridge's second ROM bank is selected instead of the first.
Similarly, the external RAM address space is only 8KiB, but the available RAM is 32KiB. Writing a value between 0 and 3 to $4000-$5FFF will select a RAM bank.
Memory-mapped I/O
The gameboy has 80 bytes of memory-mapped I/O, from address $FF00 to $FF7F. These addresses are sometimes called registers, not to be confused with the CPU registers. Not all of these addresses are used by the gameboy, and not all of the valid addresses are used by Pokémon Blue, and not all of the addresses used by Pokémon Blue are implemented in pokegb!
Here is a list of the registers used by the original gameboy, with the unimplemented ones struck through:
Low
High
Name
Description
$FF00
$FF00
JOYP
Joypad input
$FF01
$FF02
SB / SC
Serial communication
$FF04
$FF04
DIV
Clock divider
$FF05
$FF07
TIMA / TMA / TAC
Timer counter / modulo / control
$FF0F
$FF0F
IF
Interrupt request
$FF10
$FF3F
-
Sound
$FF40
$FF40
LCDC
LCD control
$FF41
$FF41
STAT
LCD status
$FF42
$FF43
SCY / SCX
Background scroll registers
$FF44
$FF44
LY
Current Y line being drawn
$FF45
$FF45
LYC
Y line compare
$FF46
$FF46
DMA
DMA transfer to OAM
$FF47
$FF47
BGP
Background palette
$FF48
$FF49
OBP0 / OBP1
Object palettes
$FF4A
$FF4B
WY / WX
Window location
$FFFF
$FFFF
IE
Interrupt enable
Unfortunately, sound is completely skipped in pokegb. It would be fun to have, but would require a lot more code, I think.
Several of the registers are used as full 8-bit values (DIV, SCY, SCX, LY, WY, WX). Others give meaning to individual bits:
Register
7
6
5
4
3
2
1
0
JOYP
1
1
1
1
1
1
1
1
JOYP (reading directions)
1
1
1
0
~Down
~Up
~Left
~Right
JOYP (reading buttons)
1
1
0
1
~Start
~Select
~B
~A
LCDC
LCD enable
Window tile map
Window enable
BG/Window tile data
BG tile map
Object size
Object enable
BG/Window enable
IF
1
1
1
Joypad
Serial
Timer
LCD STAT
Vblank
IE
1
1
1
Joypad
Serial
Timer
LCD STAT
Vblank
BGP
color 3
color 2
color 1
color 0
OBP0
color 3
color 2
color 1
color 0
OBP1
color 3
color 2
color 1
color 0
As can be seen from the table, reading from the joypad gives different values depending on whether bits 4 or 5 of JOYP are 0. Typically the program will write to JOYP to read the directions and buttons separately, then later combine them into one byte. Also, the buttons are labeled with a leading ~ because they are inverted; e.g. the up direction is pressed when bit 2 is 0.
LCDC has a few bits to enable/disable the various rendering layers: background, window and objects (i.e. sprites). It also has some bits to choose which region of video RAM is used for map and tile data:
Bit
0
1
Window tile map
$9800-$9BFF
$9C00-$9FFF
BG/Window tile data
$8800-$97FF
$8000-$8FFF
BG tile map
$9800-$9BFF
$9C00-$9FFF
The IF and IE request and enable various interrupt sources, respectively. If a given interrupt is requested and enabled, and the Interrupt Master Enable flag (IME) is enabled, then an interrupt is fired. This will cause the CPU to stop normal execution and jump to an address for the "interrupt handler", the code that will respond to the interrupt, then return control back to the main program. Of all of the possible sources, pokegb only supports the vblank interrupt. The timer and serial interrupts are also used by Pokémon Blue, but are not implemented.
Finally, the BGP, OBP0, and OBP1 registers pick a palette color for the background and sprites. When rendering graphics, the gameboy can pick from 4 color values, each of which is an index into the appropriate palette. The palette then chooses what those color values mean:
Value
Color
0
White
1
Light gray
2
Dark gray
3
Black
For pokegb, I decided to use a more fun palette with different colors for BGP, OBP0, and OBP1.
Graphics
The gameboy has a 160x144 pixel LCD screen. The graphics hardware (also known as a Picture Processing Unit or PPU) has the ability to draw one 32x32 scrollable background layer, one 32x32 "window" layer, and up to 40 8x8- or 8x16-pixel sprites. It has 8192 bytes of video RAM to stores the tile and map data. Each tile stores the pixel colors for an 8x8 block. The gameboy only has 4 colors, so each pixel's color can be represented with only 2 bits, or 4 pixels per byte. As a result, each tile is 16 bytes. The map is an array of 32x32 indexes into the tile data, one byte per tile. So the 32x32 background and window maps each use 1024 bytes.
It turns out that Pokémon Blue uses the background, window and sprite layers, but only uses 8x8-pixel sprites.
As shown above, the address of the BG and window maps can be chosen to be either $9800-$9BFF or $9C00-$9FFF. Similarly the BG/Window tile data region can be chosen to be either $8000-$8FFF or $8800-$97FF. Interestingly, if $8800-$97FF is selected the tile indexes start at $9000 and wrap around. So, for example, tile index 127 is at address $97F0 and tile index 255 is at address $8FF0.
The way the gameboy actually renders the screen is fairly complex (see Pixel FIFO in pandocs), but we can cheat. Most games (including Pokémon Blue) don't change any graphics settings while a scanline is drawing. So we can draw each scanline as it finishes.
Each scanline takes 456 T-states (or 114 M-Cycles) to render. The current T-state for this scanline is sometimes called a "dot", which has a value in the range 0-455. There are a total of 144 scanlines, and 10 lines of "vertical blank" for a total of 154 lines per screen. So the amount of time it takes to render a full screen is 70224 T-states (or 17556 M-cycles).
The vertical blank (or vblank) scanlines are a holdover from CRT screens, which needed this period to reset the electron gun to the upper-left corner of the screen. This happens to be a useful feature for the gameboy too, even though it uses an LCD screen instead, since it is a period when the PPU is not accessing video RAM or other video hardware. This gives the software an opportunity to update video RAM without affecting the on-screen visuals.
Window
The window layer is a region displayed over the background layer. It can be enabled or disabled, and positioned with the WX and WY registers. When it is enabled, it always starts at the location (WX, WY - 7) and covers the background in a rectangular region toward the lower-right of the screen. It is not scrolled by the SCX and SCY registers.
The window is used in Pokémon Blue to display the title credits, various animations, menus, etc.
Sprites
The gameboy hardware supports up to 40 sprites per frame, and up to 10 sprites per scanline. Each sprite is defined in the OAM and uses 4 bytes. The bytes are laid out as follows:
Byte
7
6
5
4
3
2
1
0
0
Y-coordinate - 16
1
X-coordinate - 8
2
Tile index
3
Priority
Y-flip
X-flip
Palette
0
0
0
0
The x- and y-coordinates of the sprite is offset by 8 and 16 pixels, respectively. This is done so a sprite can be displayed off the left or top edge of the screen. The sprite will be displayed if the x-coordinate is in the range 0-167 and the y-coordinate is in the range 0-161, inclusive.
Unlike the BG/Window, the sprite tile data is always at $8000-$8FFF, but otherwise the tile data is displayed the same way.
The priority bit specifies whether this sprite is displayed in front of (0) or behind (1) the background/window. You can see this effect in Pokémon Blue when your player walks in the tall grass. It works because the Background color 0 is considered to be transparent, so the sprite will display through.
The x-flip and y-flip bits will flip the displayed sprite horizontally, vertically, or both, but otherwise will not affect how the sprite is displayed.
The palette bit specifies whether to use OBP0 or OBP1 when displaying the sprite.
Tiles
The background layer, window layer and sprite layer are all made up of 8x8 tiles. Each tile is 16 bytes, using 2 bits-per-pixel. For each pixel, there is a low bit and a high bit. The low bit and high bits are stored in separate bytes, where each byte stores one row:
Offset
Row
Bit
0
Row 0
Low bit
1
Row 0
High bit
2
Row 1
Low bit
3
Row 1
High bit
4
Row 2
Low bit
5
Row 2
High bit
6
Row 3
Low bit
7
Row 3
High bit
8
Row 4
Low bit
9
Row 4
High bit
10
Row 5
Low bit
11
Row 5
High bit
12
Row 6
Low bit
13
Row 6
High bit
14
Row 7
Low bit
15
Row 7
High bit
So for example, if we wanted to draw the pokéball that Ash is tossing on the title screen:
Row
7
6
5
4
3
2
1
0
1
0
0
0
3
3
3
0
0
2
0
0
3
2
1
0
3
0
3
0
3
2
2
1
1
2
3
4
0
3
2
2
2
2
2
3
5
0
2
1
2
2
2
0
3
6
0
0
3
0
0
0
3
0
7
0
0
0
2
3
3
0
0
8
0
0
0
0
0
0
0
0
Then we would separate the tile into two bit planes, where the low bit plane only has the values 0 or 1, and the high bit plane only has the values 0 or 2. If we add them together for each pixel, we'll get the image above:
Row
Low byte
7
6
5
4
3
2
1
0
High byte
7
6
5
4
3
2
1
0
1
$1C
0
0
0
1
1
1
0
0
$1C
0
0
0
2
2
2
0
0
2
$2A
0
0
1
0
1
0
1
0
$32
0
0
2
2
0
0
2
0
3
$4D
0
1
0
0
1
1
0
1
$73
0
2
2
2
0
0
2
2
4
$41
0
1
0
0
0
0
0
1
$7F
0
2
2
2
2
2
2
2
5
$21
0
0
1
0
0
0
0
1
$5D
0
2
0
2
2
2
0
2
6
$22
0
0
1
0
0
0
1
0
$22
0
0
2
0
0
0
2
0
7
$0C
0
0
0
0
1
1
0
0
$1C
0
0
0
2
2
2
0
0
8
$00
0
0
0
0
0
0
0
0
$00
0
0
0
0
0
0
0
0
We then interleave these bit planes as shown above to get the 16 bytes for this tile:
With that out of the way, I'd like to talk a little about how I made pokegb. I started working on Monday, May 24th. I've already written a gameboy emulator (binjgb), so I knew some of what I'd have to do. But I didn't know a lot about Pokémon Blue's code. The gameboy actually has a lot of features, and I didn't know how many Pokémon Blue used. I was hoping that I could avoid implementing a lot of them to keep the code small.
I decided to start by loading the game in binjgb and tracing the instructions. The trace output looked like this:
This prints out the CPU state after each instruction, along with a disassembly.
When I started pokegb, I spent most of the first few days implementing the CPU. Every time I hit an instruction that wasn't implemented, I'd implement it and see if it would run any further. I made pokegb write a trace output similar to binjgb, so I could see if they were diverging. Pokegb was significantly less accurate so unfortunately I couldn't just diff the two trace outputs. Instead I had to spot check them to see if the output looked way off.
After a few days, the CPU seemed to be in good shape. The game ran for millions of instructions without failing, and matched binjgb pretty closely. At that point I started working on rendering the graphics, and added some input. To my surprise, everything seemed to come together pretty quickly at that point! I only had a few more instructions to implement, and I was able to walk around, have conversations, and have battles. I'll admit that I haven't played very far into the game, so there are very likely bugs!
Code
Let's dive into the obfuscated code. I've made an unobfuscated version, where I've given the variables better names and formatted it using clang-format, but have otherwise not changed it.
Let's start at the top.
Opcode Macros
As mentioned above, most instructions can be grouped into one of ~40 categories. To make it easier to group them, I made 10 different macros: OP4_NX8, OP4_NX16_REL, OP5_FLAG, OP8_NX8_REL, OP7_PTR, OP7_NX8, OP7_NX8_PTR, OP49_REL, OP56_PTR_REL, OP9_IMM_PTR.
The naming convention I've used is as follows:
OPn: This macro defines n instructions.
NXm: Each instruction is spaced out by multiples of m.
PTR: This macro also sets the ptr8 variable, which points to the address of the 8-bit source register.
REL: This macro also sets the opcode_rel variable, which is the offset of the given opcode in its group (e.g. OP4_NX16_REL will have opcode_rel values of 0, 16, 32, or 48). This can be helpful for indexing into a register array.
IMM: This macro also defines the "immediate" instruction e.g, ADD A, u8.
FLAG: This macro defines an instruction that uses the condition codes as well as its unconditional version, e.g. CALL NZ, u16 and CALL u16.
Let's take a look at some of the more interesting macros:
First up is OP5_FLAG. This macro defines 4 conditional instructions, as well as the unconditional version. Here the variable carry is re-purposed to determine whether the operation should be performed:
The F_mask and F_equals arrays are defined to encode the condition table above, and F is the flag register:
For example, the condition NZ means to execute the instruction if the Z flag is not set. The Z flag is bit 7 of the flags register, or 128 in decimal. So we can tell if Z is not set by calculating (F & 128) == 0.
int F_mask[]={128,128,16,16}, F_equals[]={0,128,0,16};
Next is OP9_IMM_PTR. This macro defines 9 instructions, one for each entry in the r8 table above, as well as for the immediate instruction (which is always at an offset of 70). Each of these instructions operate on an 8-bit operand: one of the 7 registers (A, B, C, D, E, H, or L), the byte at address HL, or the immediate value u8.
The read8_pc() function reads the next byte at PC (i.e. the next byte in the instruction stream). The read8() function can read a byte at any address, but defaults to reading the byte at HL. And as mentioned above, ptr8 points to one of the 8-bit registers.
It's convenient to be able to access most registers both as their 16-bit pairs or their 8-bit parts. So all CPU registers are defined in an array, along with their initial values.
These are defined in the order C, B, E, D, L, H, SP High, SP Low. This allows us to read a 16-bit register pair directly as long as the host machine is little-endian. Note that SP is never accessed as an 8-bit register, so it doesn't need to be defined here; this may be a place where I could save some additional bytes.
The A and F registers are used for many of the instructions directly, so we create references to them by name.
The r8 group described above uses the order B, C, D, E, F, H, L, (HL), A. The reg8_group array matches this, but uses &F instead. This is never actually used, so it could have been 0 instead to save a byte.
There are three r16 groups above, and each are represented here, with reg16 being r16 (group 3). Group 2 is actually meant to be BC, DE, HL+, HL-, where HL+ and HL- use the current value of HL and then increment or decrement, respectively. That behavior is not included here, but instead used directly in the instructions' implementations via the HL_add array.
As with A and F, HL and SP are often used directly, so we've created references to them by name.
Finally, there is a helper function for setting CPU flags. The mask parameter masks out any flags that are supposed to be set by this instruction, then the Z, N, H, C parameters will set it to their new value. As a size optimization, the Z flag is negated here. That's because many instructions set the Z flag if the instruction produces a 0 result, so it's convenient to do the logical not here rather than all places that call set_flags().
voidset_flags(uint8_t mask,int Z,int N,int H,int C){ F =(F & mask)|(!Z *128+ N *64+ H *32+ C *16); }
Memory
All the memory for the system is stored in static arrays or memory-mapped regions.
uint8_t*rom0,*rom1,*extram,*extrambank;
rom0: a pointer to the first 16KiB bank of ROM (addresses $0000-$3FFF). This is memory-mapped at the beginning of the program.
rom1: a pointer to the currently mapped bank of ROM1 (addresses $4000-$7FFF).
extram: a pointer to the entire 32KiB of External RAM. This is memory-mapped at the beginning of the program.
extrambank: A pointer to the currently mapped 8KiB of External RAM.
The frame_buffer is a 160 * 144 = 23040 buffer, with an int for each pixel. I assume that an int is 4-bytes, so it can store a 32-bit RGBA value.
int frame_buffer[23040]; int palette[]={-1,-23197,-65536,-16777216,-1,-8092417, -12961132,-16777216,-1,-23197,-65536,-16777216};
The palette values look strange, but they are stored as signed-integers to save in source code size. When written as hexadecimal values, they look more natural. (Note that these values are stored so the red channel is the least-significant byte and the alpha channel is the most-significant byte):
Palette
Color 0
Color 1
Color 2
Color 3
BGP
0xffffffff
0xffffa563
0xffff0000
0xff000000
OBP0
0xffffffff
0xff8484ff
0xff3a3a94
0xff000000
OBP1
0xffffffff
0xffffa563
0xffff0000
0xff000000
Memory Access
The CPU can only read or write one byte at a time. The mem_access() function implements both reading and writing to save on source code size. If write == 0 then val is ignored. If write != 0, then the return value is ignored.
First, note that all memory accesses call tick(). This increments the cycle counter by 4, which is the number of T-states required to do an 8-bit memory access.
As a size-optimization, both read8() and write8() take HL as the default address. This requires write8() to be written "backward", with val as the first parameter and addr as the second.
In mem_access(), the address is shifted right by 12. This extracts the top nibble of the address, which can then be used to determine which region of memory was accessed.
For cases 2 and 3, the address being accessed is in the range $2000-$3FFF. If this region is written to, it will remap ROM bank 1 (see the Memory Bank Controllers section above). The bank number is shifted left by 14, since each bank is 1 << 14 = 16KiB.
Either way, we fall through to cases 0 and 1 below. For 0 and 1, the address being accessed is in the range $0000-$1FFF, which is in ROM bank 0.
case2:case3: if(write) rom1 = rom0 +((val ? val &63:1)<<14);
case0:case1: return rom0[addr];
For cases 4 and 5, the address being accessed is in the range $4000-5FFF. If the region is written to, it will remap the External RAM bank. Either way, we fall through to cases 6 and 7 below, which covers the ROM bank 1 region.
case4:case5: if(write && val <=3) extrambank = extram +(val <<13);
case6:case7: return rom1[addr &16383];
Cases 8 and 9 are the video RAM region, addresses $8000-$9FFF. On the real gameboy the video RAM cannot be updated during certain times (e.g. when the PPU is accessing it), but we assume the game is well-behaved and allow it at all times.
Cases 10 and 11 are in the external RAM region, addresses $A000-$BFFF. On real gameboy there are ways to disable read and write access to this region, but again we assume that the game is well-behaved and allow it at all times.
Case 15 is the most complex. Here we first test if the address is greater than or equal to 65024 ($FE00). If so, then we know that we're accessing somewhere in the io region, which includes OAM, I/O, and High RAM.
If the address is a write to 65350 ($FF46), then the game is requesting a DMA transfer to OAM. This is an extremely common way for games to copy sprite data into OAM, since it is much faster than having the CPU do it manually. The value written to $FF46 is used as the high byte of the source address. So, for example, if the value is $C1, then 160 bytes will be copied from $C100-$C19F to OAM.
If the address read from is 65280 ($FF00), then the game is reading the JOYP register which gives the current joypad state. The value returned depends on what was previously written to JOYP, either the directions buttons if bit 4 of JOYP is 0, or the face buttons if bit 5 of JOYP is 0. If neither is 0, then JOYP returns 255 ($FF).
Finally, if the memory access address is less than 65024, then it must be an access to work RAM, either the primary address range at $C000-$DFFF, or the mirror range at $E000-$FDFF. In either case the address is masked and a write or read is performed.
A few more functions are used to reduce code size. First read8_pc() (mentioned above), which reads a byte from the instruction stream.
uint8_tread8_pc(){returnread8(PC++);}
Next read16(), which reads a 16-bit value from any address, but defaults to PC. The gameboy CPU is little-endian, so this writes the least-significant byte first. This function is also used to implement the POP instruction. Note that addr must be incremented in separate statements to avoid undefined behavior.
Then push(), which pushes a 16-bit value onto the stack. The most-significant byte is written first, because the value is written as SP is decremented. Also note that the tick() is called because PUSH always takes an additional M-cycle.
The main() function begins by initializing the ROM and External RAM areas:
Since we know that Pokémon Blue is 1MiB, we can directly mmap it. This saves some space over fopen/fread as long as we use numbers instead of PROT_READ, etc.
Similarly, we can mmap the External RAM file, though this is even more convenient since any writes to this region will automatically be written to the save file.
One slightly annoying quirk is that the file must be truncated to the correct length, or else writes will crash the emulator. This behavior seems to work on Linux, but fails on macOS (see pokegb issue 1):
The LCDC and DIV I/O registers are set to initial values to match binjgb. It's likely that this isn't actually required, and could be removed to save some bytes.
LCDC =145; DIV =44032;
Next we initialize SDL:
SDL_Init(SDL_INIT_VIDEO);
Then create a window at 5x the gameboy resolution, along with a renderer that can draw to it:
And finally grab the keyboard state array. Conveniently, this will be automatically updated by SDL whenever we poll for events:
key_state =SDL_GetKeyboardState(0);
CPU
After initialization, we start a loop that steps the CPU and updates the graphics. There are a few different states that the CPU can be in: running normally, "halted" waiting for an interrupt, and handling an interrupt.
That is all handled at the start of the loop:
while(1){ prev_cycles = cycles; if(IME & IF & io[511]){ IF = halt = IME =0; cycles +=8; carry =1; temp16 =64; goto CALL; }elseif(halt) tick(); else switch(opcode =read8_pc()){ ... } ... }
First, the the current cycle counter is saved so we know how many cycles to update the PPU later. Then we check if any bit is set in all of Interrupt Master Enable flag (IME), Interrupt Request (IF), and Interrupt Enable (io[511]). Since pokegb only supports the vblank interrupt, then we know that we only need to check for bit 0, but it's smaller to check for any bit here.
If an interrupt occurs, the CPU is no longer halting, and the IME and IF flags are cleared. The CPU then performs the equivalent of a CALL to address $0040 (64) which is the vblank interrupt handler.
If the CPU is halted, then we simply tick the clock and loop until an interrupt occurs.
Otherwise, we read the next byte in the instruction stream and try to execute it.
Instructions
This is the real meat of the emulator. As we saw above, there are a lot of instructions, and most of them are needed to run Pokémon Blue. Let's start with the simplest instruction, NOP, which does nothing:
case0: break;
Off to a good start! Next is LD r16, u16, which copies a 16-bit value from the instruction stream into a 16-bit register pair. This handles opcodes 1, 17, 33, and 49. Since these opcodes are spaced by multiples of 16, we can shift right by 4 to index into r16 (group 1):
Next up, LD (r16), A, which writes A into memory at a given address. This also can optionally increment or decrement HL. Notice that we use opcode_rel for this; it's not required in this case since opcode / 16 is equal to opcode_rel / 16. But for other instructions it will be useful:
Likewise there are INC r8 and DEC r8 which increment or decrement an 8-bit register. Unlike their 16-bit equivalents, these update CPU flags. goto is used to share some code:
OP7_NX8_PTR(4) tmp8 =++(*ptr8); goto INC_FLAGS;
OP7_NX8_PTR(5) tmp8 =--(*ptr8); goto DEC_FLAGS;
The r8 group also includes (HL), which means to use the byte at the address HL instead. It is simpler to handle these separately, using read8() and write8().
These instructions also have the code to handle setting flags. Remember that the first parameter is the mask. 16 is the value of the carry flag, so this mask means that the carry flag will be preserved. The other flags are set according to the rest of the arguments: Z is set if the result is zero, N is set if the operation was a decrement, and H is set if there was an overflow to bit 4 of the result.
Next is LD r8, u8, which loads a value from the instruction stream into a register or (HL).
OP7_NX8_PTR(6) *ptr8 =read8_pc(); break;
case54: write8(read8_pc()); break;
Next, RLCA and RRCA, which rotate the A register left or right by one bit. The bit that is rotated out is copied to the carry flag, as well as moved to the other end.
A += A is used instead of A << 1, and A / 2 is used instead of A >> 1 to reduce code size.
case7: A += A +(carry = A >>7); goto CARRY_FLAG;
case15: A =(carry = A &1)*128+ A /2; goto CARRY_FLAG;
RLA is similar to RLCA, but it rotates the A register "through" the carry flag, meaning that it is a 9-bit rotate. Bit 7 of A is moved into the carry flag, and the carry flag is moved into bit 0 of A.
When the flags are set, the mask is 0, which means that all flags are set: Z is always 1, N and H are 0, and C is set to carry:
The expression F / 16 % 2 looks odd, but extracts the carry bit from F into bit 0.
case23: carry = A >>7; A += A + F /16%2; CARRY_FLAG: set_flags(0,1,0,0, carry); break;
Now onto the JR instructions. They branch the PC to a new location, relative to where the current PC is, either forward or backward. That's why tmp8 is cast to an int8_t. The carry variable is used as a condition, which is automatically set by the OP5_FLAG macro. Notice also that this uses the comma operator to save a byte of source code.
OP5_FLAG(32,24) tmp8 =read8_pc(); if(carry) PC +=(int8_t)tmp8,tick(); break;
Next is a beast of an instruction, Decimal Adjust for Addition (DAA). It's intended to be used for handling arithmetic on Binary Coded Decimal (BCD) values.
The basic idea is relatively simple: a two-digit decimal number can be stored in a byte where each nibble stores one digit. For example, the decimal number 42 can be stored as the hexadecimal number $42 (66). This is useful for older harware that didn't have a divide instruction, like the gameboy.
DAA is used after addition or subtraction to fix the BCD value. So if we take the decimal numbers 42 and add 9, we expect to get 51. But if we do this with BCD values, we'll get $4B instead. Executing DAA after this addition will adjust $4B to $51 as expected.
The first if determines whether we need to adjust the least-significant nibble, and the second if determins whether we need to adjust the most-significant nibble. The adjustment value added to (or subtracted from) A is either $00, $06, $60, or $66, depending on the values of the H, C, and N flags, and the current value of the A register.
For our example, $06 will be added to $4B to produce $51.
case39: carry = tmp8 =0; if(F &32||(!(F &64)&& A %15>9)) tmp8 =6; if(F &16||(!(F &64)&& A >153)) tmp8 |=96, carry =1; set_flags(65, A +=(F &64)?-tmp8 : tmp8,0,0, carry); break;
Now a simpler one, CPL. This instruction performs the ones' complement of A, which means flipping all of its bits.
case47: A =~A; set_flags(129,1,1,1,0); break;
SCF and CCF set or complement the carry flag. The implementation combines them to save on space.
The LD r8, (HL) and LD (HL), r8 instructions read from or write to memory. In this case there is an exception to the pattern; normally the r8 group includes (HL), but there is no LD (HL), (HL) instruction. That instruction is instead HALT.
OP7_NX8_PTR(70) *ptr8 =read8(); break;
OP7_PTR(112) write8(*ptr8); break;
Speaking of which, here's HALT. On the real gameboy, there are some complexities with how HALT behaves, but here we merely set the halt variable to 1.
case118: halt =1; break;
The LD r8, r8 group covers 49 different instructions, not including (HL). It has several seemingly redundant instructions, such as LD B, B and LD C, C. Some emulators use these instructions as a breakpoint instead, since they otherwise have no purpose:
Here's one of the trickier set of instructions. There are 4 different operations handled here: addition (ADD), addition with carry (ADC), subtraction (SUB), and subtraction with carry (SBC). In other words, each instruction does the following:
ADD: A = A + operand + 0
ADC: A = A + operand + carry
SUB: A = A - operand - 0
SBC: A = A - operand - carry
They all can be implemented using the same code. In regular mathematics, we can rewrite subtraction as addition (e.g. 3 - 5 = 3 + -5). We can do the same for two's complement arithmetic, but instead of negating the operand, we can complement it (~) and add one, which is equivalent:
ADD: A = A + operand + 0
ADC: A = A + operand + carry
SUB: A = A + ~operand + 1 - 0
SBC: A = A + ~operand + 1 - carry
But since carry is either 0 or 1, we can simplify this to:
ADD: A = A + operand + 0
ADC: A = A + operand + carry
SUB: A = A + ~operand + 1
SBC: A = A + ~operand + ~carry
The only additional complication is how to deal with the flags. The N flag is simplest, it is 0 for addition and 1 for subtraction. The H flag is set if there is an overflow into bit 4 of the result. It turns out that we can correctly calculate this by seeing if the sum of the low nibbles of A, operand, and carry is greater than 15. For subtraction, we take this result and flip it, i.e. we only set the H flag if the sum is not greater than 15. Similarly, the C flag checks for an overflow out of bit 7 of the result, so we check whether the sum A + operand + carry is greater than 255.
OP9_IMM_PTR(128) neg = carry =0; goto ALU;
OP9_IMM_PTR(136) neg =0; carry = F /16%2; goto ALU;
OP9_IMM_PTR(144) carry =1; goto SUBTRACT;
OP9_IMM_PTR(152) carry =!(F /16%2); SUBTRACT: neg =1; operand =~operand; ALU: set_flags(0, tmp8 = A + operand + carry, neg, (A %16+ operand %16+ carry >15)^ neg, (A + operand + carry >255)^ neg); A = tmp8; break;
Next are the common bitwise operations, AND, OR, and XOR. They are relatively straightforward since c++ has built-in operators for them:
OP9_IMM_PTR(160) set_flags(0, A &= operand,0,1,0); break;
OP9_IMM_PTR(168) set_flags(0, A ^= operand,0,0,0); break;
OP9_IMM_PTR(176) set_flags(0, A |= operand,0,0,0); break;
Next up is the collection of CP instructions. They have the same behavior as SUB, except they don't modify A, only the CPU flags:
OP9_IMM_PTR(184) set_flags(0, A != operand,1, A %16< operand %16, A < operand); break;
Now we get to the set of RET instructions. They return from a function call by popping the return address off of the stack. Like the JR instructions above, they come in "conditional" and "unconditional" flavors. The OP5_FLAG macro sets carry to the result of the condition. But there's also RETI, which is the same as RET followed by EI (enable interrupts).
case217: carry = IME =1; goto RET;
OP5_FLAG(192,201) RET: tick(); if(carry) PC =read16(SP); break;
Then onto PUSH and POP instructions. These write to or read from the stack, and then move the stack pointer accordingly. They don't have to be used for stack access, however. Pokémon Blue actually uses them to decode some graphics during vblank, since POP is a fast way to read 16 bits into a register pair.
The JP instructions is similar to the JR instructions, except that the jump is absolute instead of relative.
OP5_FLAG(194,195) temp16 =read16(); if(carry) PC = temp16,tick(); break;
The CALL instructions are similar to the JP instructions, but the also push the old PC to the stack first:
OP5_FLAG(196,205) temp16 =read16(); CALL: if(carry) push(PC), PC = temp16; break;
The gameboy also has some instructions to make it easier to access the top 256 bytes of the address space ($FF00-$FFFF). The byte offset is either read from the instruction stream as ($FF00 + u8) or from the C register as ($FF00 + C).
The LD A, (u16) instruction is also implemented here, which reads an address from memory, then reads that byte into the A register. Its implementation is combined with LD A, ($FF00 + u8) to save space.
case240:case250: A =read8(opcode ==240?65280|read8_pc():read16()); break;
The counterpart to LD A, (u16) is LD (u16), A, which reads an address from the instruction stream, then writes the value in the A register to that address. This probably could have been combined with case 224 and 226 above to save space.
case234: write8(A,read16()); break;
All of the other jump instructions we've seen so far take an immediate offset or address, but JP HL is used when you want to jump to an arbitrary address, for example, to implement a jump table.
case233: PC = HL; break;
The Interrupt Master Enable (IME) flag can be modified with the DI and EI instructions, either disabling or enabling the flag.
case243:case251: IME = opcode !=243; break;
Finally, there are two instructions that copy between HP and SP. The LD HL, SP + i8 adds a signed value to SP and stores it in the HL register. On the other hand, LD SP, HL just copies the HL register directly into SP.
The $CB instruction is an instruction prefix. All $CB instructions are two bytes, and the second byte determines which operation it is (see opcode group 3 above).
First off are RLC and RRC, which are similar to RLCA and RRCA above but can be used for any register (as well as HL) Interestingly, the RRC (HL) instruction is also used by Pokémon Blue, but RLC (HL) doesn't seem to be. It works the same as RRC r8, but operates on the memory at (HL):
Now Shift Left Arithmetic (SLA) and Shift Right Arithmetic (SRA). "Arithmetic" means that SRA behaves like a signed divide by 2. In two's complement, that means that the most-significant bit (i.e. the sign bit) is not changed. For example, performing SRA on $43 (67 in decimal, 01000011 in binary) results in $21 (33 in decimal, 00100001 in binary). On the other hand, performing SRA on $F0 (-16 in decimal, 11110000 in binary) results in $F8 (-8 in decimal, 11111000 in binary).
On the other hand, Shift Right Logical (SRL) fills 0 into the most-significant bit after a shift. So for example, performing SRL on $F0 (240 in decimal, 11110000 in binary) results in $78 (120 in decimal, 01111000 in binary).
The BIT instructions check whether a given bit is set, and set the Z flag accordingly. The bit to check is encoded in the opcode itself, so we can use opcode_rel / 8 to extract it:
With that, we've completed the CPU implementation, in 1955 bytes (after removing all unnecessary whitespace).
Graphics
Now that we've executed the CPU, we want the PPU to catch up. The current CPU cycle count is in cycles, and the cycle count before executing the instruction is in prev_cycles. So we can update the PPU by cycles - prev_cycles T-states.
The DIV I/O register counts the number of T-states as well, but only stores the high byte of this value at address $FF04. Fortunately $FF03 is left unused, so we can treat DIV as a 16-bit value starting at $FF03 instead, and increment it directly:
First off, we only want to draw the screen if the LCD is enabled (see LCDC bit 7). So we check for that, and otherwise reset the current PPU dot and the current scanline counter to zero:
if(LCDC &128){ ... }else LY = ppu_dot =0;
Next we want to trigger a vblank interrupt when the screen is finished rendering. This happens after the 143rd line finishes, so we check whether the current "dot" was 0, and whether the current scanline is 144, and set the interrupt request to 1. This could have been just IF=1, since we don't support any other interrupts:
if(++ppu_dot ==1&& LY ==144) IF |=1;
After that is the fun bit, actually drawing a scanline! Since we're keeping things simple, we only render a scanline when it finishes, at dot 456. It is possible on the real gameboy to modify PPU settings during the scanline, but those don't happen in Pokémon Blue.
After rendering the line, we increment LY and reset ppu_dot back to 0.
if(ppu_dot ==456){ ...
LY =(LY +1)%154; ppu_dot =0; }
If the current scanline is less than 144, we draw it. This includes rendering the background and window layers, and any sprites that are on this line. We loop over every x-coordinate as tmp:
if(LY <144) for(tmp =160;--tmp >=0;){ ... }
First we want to determine if we're drawing the window layer or the background layer. Like everything else, the real logic to determine this is somewhat complex, but we can do a simpler version. If the window is enabled, and the current pixel we're drawing (tmp, LY) is greater than the upper-left coordinate of the window (WX - 7, WY), then we are drawing the window. Recall from above that WY is io[330] and WX is io[331]:
Now we want to find out the pixel offset of the current pixel, either into the background map or the window map. If this pixel is displaying the background, then we offset by the background scroll registers (SCX, SCY). If instead this pixel is displaying the window, then we offset by the window's upper left corner (WX - 7, WY). Again, note that SCX is io[323] and SCY is io[322]:
Now that we know the pixel offset, we can determine which tile is being drawn. The background and window maps start at either address $9800 or $9C00, depending on the settings in LCDC. Since we're reading from video RAM directly (rather than through the read8() function), we want to use the offset into video RAM instead, $1800 or $1C00. This can be rewritten as $6 << 10 and $7 << 10.
The window map is at $1C00 if LCDC bit 6 is set (== 64), and the background map is at $1C00 if LCDC bit 3 is set (== 8). So we can determine the base map address by checking the correct bit, depending on is_window.
Finally, the tile offset within the map can be found by taking the pixel offset and dividing by 8, since there are 8 pixels per tile in both dimensions. The map is 32x32 tiles, and each tile is one byte, so multiplying the y tile offset by 32 gives the final expression.
We also set the palette_index to 0 by default (which means to use BGP).
Now that we know which tile is being rendered, we have to read the pixel data for that tile. The starting address of the tile data is either $8000 or $8800. But as mentioned above, if the starting address is $8800, then tile 0 is actually at address $9000, and then the tile indexes wrap around back to $8800 at tile index 128. LCDC bit 4 (== 16) determines this behavior.
We can get the pixel row of the tile to draw with the expression y_offset % 8. Each row is 2` bytes (since it has two bitplanes) and each tile is 16 bytes:
This gives us the low byte of this row in tile_data[0] and the high byte in tile_data[1]. x-offset & 7 will give us the offset of the current pixel in this tile's row. However, you may have noticed above that a tile's row has the leftmost pixel as the most-significant bit. So, for example, if we are at pixel offset 0, we want to use bit 7 of the row. This means that the amount to shift right is actually 7 - x_offset, which is the same as x_offset ^ 7.
x_offset =(x_offset ^7)&7;
uint8_t color = (tile_data[1]>> x_offset)%2*2+ (*tile_data >> x_offset)%2;
Let's say we're drawing the fifth pixel of the second row of Ash's pokéball:
Byte
7
6
5
4
3
2
1
0
Low
0
0
1
0
1
0
1
0
High
0
0
1
1
0
0
1
0
Then x_offset is 3, so when we shift right by 3, we get:
Byte
7
6
5
4
3
2
1
0
Low
0
0
0
0
0
1
0
1
High
0
0
0
0
0
1
1
0
Then we perform % 2:
Byte
7
6
5
4
3
2
1
0
Low
0
0
0
0
0
0
0
1
High
0
0
0
0
0
0
0
0
And then we shift the high byte left by 1, and add the bytes together:
Byte
7
6
5
4
3
2
1
0
Both
0
0
0
0
0
0
1
0
This gives us a final color of 2.
After we've calculated the background or window pixel color, we may have to draw a sprite on top of it. Sprites can be enabled or disabled with bit 1 of LCDC:
if(LCDC &2) ...
OAM has enough space for 40 sprites, each one requiring 4 bytes. On a real gameboy only 10 sprites can be drawn per scanline, but we'll ignore that limit. OAM starts at the beginning of io (address $FE00-$FE9F).
The sprite's coordinate is (sprite[1] - 8, sprite[0] - 16). Similar to the background tile, we can find which row and column of the sprite tile by subtracting the current pixel's coordinate from the sprite's. The sprite's tile is drawn if the x- and y-offset are both less than 8:
As with the background tile, we want to xor the x-offset with 7 so the most-significant bit is also the leftmost bit. However, a sprite can also be flipped horizontally if bit 5 of sprite[3] is set. In that case we can leave the x-offset as it is by xor'ing it with 0:
sprite_x ^= sprite[3]&32?0:7;
Again, like background tiles, we want to fetch the tile data row for this sprite. The sprite's tile is in sprite[2]. The sprite can also be flipped vertically, so we xor the sprite's y-offset with 7 if bit 6 of sprite[3] is set.
A sprite's pixel is only drawn if the sprite's pixel is non-transparent. But the background pixel also must be transparent or the sprite must have priority over the background. In other words, if the background pixel is non-transparent and the sprite doesn't have priority, then the sprite's pixel will not be drawn.
The sprite priority is bit 7 of sprite[3]; if it is 1, then the sprite is drawn behind the background.
The sprite can use one of two palettes, OBP0 and OBP1. If bit 3 of sprite[3] is 1, then palette OBP1 is used. We want palette_index to be 1 for OBP0 and 2 for OBP1, so we add one to the value. !! is a little trick to convert a non-zero value to 1:
if(!((sprite[3]&128)&& color)&& sprite_color){ color = sprite_color; palette_index +=1+!!(sprite[3]&8); break; }
Now that we know the color and palette_index, we can look it up in the palette to find the actual RGBA color to draw. Each palette is a single byte, stored like this:
Name
Address
io offset
7
6
5
4
3
2
1
0
BGP
$FF47
327
color 3
color 2
color 1
color 0
OBP0
$FF48
328
color 3
color 2
color 1
color 0
OBP1
$FF49
329
color 3
color 2
color 1
color 0
Adding 327 to palette_index gives us the offset in io for the palette. We can then shift the palette byte to the right by 2 * color to isolate the palette entry we want.
We'll write the final RGBA color into the frame buffer at the current pixel's x- and y-coordinate, (tmp, LY).
Once we've finished drawing all the lines, we can finally draw the framebuffer to the window! First we have to lock the texture that we created before, so we have access to its pixels. The tmp parameter stores the pitch of the texture, i.e. the width of a row of the texture in bytes, which may include extra padding.
For each row of the frame buffer, we copy it into the texture. The total number of bytes copied per row is 160 * 4 == 640, since an RGBA pixel is 4 bytes:
Before I sent the tweet, I wanted to make the code look cool. I remember looking at some examples from the IOCCC and thought I'd take a chance at formatting the code into a picture. See, for example, Don Yang's winning entry from the 2020 competition:
To start, I took the source code I had, and renamed all the variables to single letter names, where possible. I then removed all whitespace to see how many characters I was using, which was 4150 bytes, not including the #include and #define lines.
My first thought was to use the Pokemon logo itself. I tried to make a good 1 bit-per-pixel version, but it ended up looking like garbage when I tried to fit the source code to it:
My next thought was to use a single pokéball. I used the awesome Desmos online calculator to try to figure out how big each circle should be interactively:
This looks pretty good, but it's a very large pokéball with a diameter of 96. I realized that three pokéballs side-by-side would probably work better, giving a width of 150 and a height of 50:
I then recreated as a 150x50 pixel image so that I could convert it to text and use it as a guide for my source code layout:
The result ends up stretched vertically, since most fonts are taller than they are wide, but I still think it looks good! Maybe next time I'll try to take that into account.
Conclusion
I had a lot of fun making pokegb, and I hope you enjoyed reading about it! If I left anything out or made any mistakes, feel free to send me a message on twitter @binjimint. Thanks to everyone on the emudev discord and the gbdev discord for making such a great community. Cheers!
via GitHub Pages | Websites for you and your projects, hosted directly from your GitHub repository. Just edit, push, and your changes are live.
June 6, 2021 at 01:00PM
The text was updated successfully, but these errors were encountered:
POKEGB: a gameboy emulator that only plays Pokémon blue
https://ift.tt/3uPaBjH
POKEGB: A gameboy emulator that only plays Pokémon Blue
I recently tweeted out an emulator that I made over the last few days:
In the video, I show some source code that looks like 3 pokéballs, compile it with gcc, and run it to play Pokémon Blue. The game is controllable, but has no sound. It renders the graphics w/ 12 colors (shades of red and blue). I choose BLUE as the name for our hero, and JERK as the name of his rival. The video ends when Professor Oak stops me from walking into the tall grass.
A lot of folks asked me to do a write-up of how this all works, so let's dive in!
(TL;DR: this is a very long writeup, check out the unobfuscated code instead if you prefer!)
Stats
But first, let's talk a little about some statistics. The final version in that tweet is actually 68 lines of code (I tweeted the wrong number!), each line less than 150 characters long, for a total of 9956 bytes. If you ignore whitespace and comments, it comes to 4720 bytes. That's too large for the International Obfuscated C Code Contest, but pretty close.
Prior to formatting the source code to look like pokéballs, the source was 188 lines and a total of 7786 bytes, with 5954 bytes of non-whitespace source.
Overview
It'll be useful to know a little about how the gameboy works. This won't go into all the details (for that you can read the pandocs), but will be enough to understand how this code works. Many values are easier to represent in hexadecimal, so I'll write those with a leading
$
, e.g.$FE00
.CPU
The gameboy CPU is a bit weird -- it's kind of like an Intel 8080 and kind of like a Zilog Z80, but not the same as either. It has an 8-bit accumulator
A
, and 3 16-bit register pairs,BC
,DE
, andHL
, which can be individually accessed as the 8-bit registersB
,C
,D
,E
,H
andL
. It also has a 16-bit stack pointerSP
, a 16-bit program counterPC
.The cpu also has 4 flag bits:
Z
,N
,H
, andC
:When accessing the AF register (which can only be done via push/pop instructions), the flags are stored in the most-significant bits of F:
The gameboy instruction set has ~500 instructions, each up to 3 bytes long. For example, the
NOP
instruction is just one byte (00
), but theJP
instruction is 3 bytes:$C3
followed by a 16-bit address$50 $01
.The instructions can be grouped as follows (taken from peach.bot's excellent document on the Emulation Development discord). I've included all instructions, but have struck through the ones that are not needed for Pokémon Blue. Feel free to skip over this on first read, I'll go into more detail later:
LD (u16), SPSTOPADD SP, i8LD A, ($FF00 + C)RST (call to 00EXP000)You can also use the
CB
prefix to access a different set of instructions:r8
is encoded as:r16
is encoded as:condition
is encoded as:opcode
is encoded as:RRAAs you can see, there are many instructions but we can group them into ~40 different categories, 5 of which aren't used by Pokémon Blue (as far as I know). (Turns out I was wrong,
RRA
is used; see the TryingToLearn function in the pokered source).One additional complication is how long each of these instructions take to execute. This is measured in "T-states", which run at the frequency of the CPU, ~4.19MHz. All instructions operate on a multiple of 4 T-states, which is often called machine-cycles or "M-cycles".
It's important to keep track of how many M-cycles each instruction takes, since other hardware components (such as the graphics rendering and audio) take time too, and we want to make sure that the CPU and these components stay in sync. Fortunately, the M-cycle counts for most instructions is relatively simple — every memory access costs 1 M-cycle (including the cost of reading the instruction). Some instructions take additional M-cycles, but we can handle those cases explicitly.
Memory
The gameboy CPU has a 16-bit address space, divided up into regions with different purposes. Here's a simplified layout:
Memory Bank Controllers
Gameboy cartridges often also use a Memory Bank Controller (MBC), which allows swapping out regions of ROM and RAM at runtime, as well as other features. Without this it would be impossible to access more than 32768 bytes of ROM.
Pokémon Blue uses MBC3, which has the ability to swap 64 different 16KiB banks of ROM (for a total of 1MiB), as well as 4 different of 8KiB banks of External Ram (for a total of 32KiB). The Pokémon Blue cartridge also has support for a real-time clock, but I didn't implement support for that.
The way you communicate with the MBC is by writing to various address regions in ROM (i.e. in the range
$0000-$7FFF
). See the following table for details. The features that are unimplemented in pokegb have been struck through:External RAM and timer enable(or RTC register select)Latch clock dataReading and writing to external RAM can be enabled and disabled. I believe this was used because the CPU may write garbage data to memory when losing power (e.g. when the gameboy is switched off). We don't care about that, so we'll skip implementing it. Fortunately, as far as I can tell, the game doesn't ever read from or write to this region when it is disabled.
As mentioned before, the cartridge is 1MiB, but the CPU address space is only 64KiB. So to access the rest of the cartridge data we have to remap ROM. Only ROM bank 1 can be remapped; ROM bank 0 is always mapped to the cartridge's first 16KiB chunk.
The mapped memory region in ROM bank 1 can be changed by writing a value between 0 and 63 to any address in the region
$2000-$3FFF
. So for example, if we write the value 5 to address$2000
, then the data mapped to addresses$4000-$7FFF
will reference the sixth 16KiB ROM bank of the cartridge (which are the bytes in range$14000-$17FFF
). The one exception is that if bank 0 is selected, then the cartridge's second ROM bank is selected instead of the first.Similarly, the external RAM address space is only 8KiB, but the available RAM is 32KiB. Writing a value between 0 and 3 to
$4000-$5FFF
will select a RAM bank.Memory-mapped I/O
The gameboy has 80 bytes of memory-mapped I/O, from address
$FF00
to$FF7F
. These addresses are sometimes called registers, not to be confused with the CPU registers. Not all of these addresses are used by the gameboy, and not all of the valid addresses are used by Pokémon Blue, and not all of the addresses used by Pokémon Blue are implemented in pokegb!Here is a list of the registers used by the original gameboy, with the unimplemented ones struck through:
Serial communicationTimer counter / modulo / controlSoundLCD statusY line compareUnfortunately, sound is completely skipped in pokegb. It would be fun to have, but would require a lot more code, I think.
Several of the registers are used as full 8-bit values (
DIV
,SCY
,SCX
,LY
,WY
,WX
). Others give meaning to individual bits:Object sizeBG/Window enableJoypadSerialTimerLCD STATJoypadSerialTimerLCD STATAs can be seen from the table, reading from the joypad gives different values depending on whether bits 4 or 5 of
JOYP
are 0. Typically the program will write to JOYP to read the directions and buttons separately, then later combine them into one byte. Also, the buttons are labeled with a leading ~ because they are inverted; e.g. the up direction is pressed when bit 2 is 0.LCDC
has a few bits to enable/disable the various rendering layers: background, window and objects (i.e. sprites). It also has some bits to choose which region of video RAM is used for map and tile data:The
IF
andIE
request and enable various interrupt sources, respectively. If a given interrupt is requested and enabled, and the Interrupt Master Enable flag (IME) is enabled, then an interrupt is fired. This will cause the CPU to stop normal execution and jump to an address for the "interrupt handler", the code that will respond to the interrupt, then return control back to the main program. Of all of the possible sources, pokegb only supports the vblank interrupt. The timer and serial interrupts are also used by Pokémon Blue, but are not implemented.Finally, the
BGP
,OBP0
, andOBP1
registers pick a palette color for the background and sprites. When rendering graphics, the gameboy can pick from 4 color values, each of which is an index into the appropriate palette. The palette then chooses what those color values mean:For pokegb, I decided to use a more fun palette with different colors for BGP, OBP0, and OBP1.
Graphics
The gameboy has a 160x144 pixel LCD screen. The graphics hardware (also known as a Picture Processing Unit or PPU) has the ability to draw one 32x32 scrollable background layer, one 32x32 "window" layer, and up to 40 8x8- or 8x16-pixel sprites. It has 8192 bytes of video RAM to stores the tile and map data. Each tile stores the pixel colors for an 8x8 block. The gameboy only has 4 colors, so each pixel's color can be represented with only 2 bits, or 4 pixels per byte. As a result, each tile is 16 bytes. The map is an array of 32x32 indexes into the tile data, one byte per tile. So the 32x32 background and window maps each use 1024 bytes.
It turns out that Pokémon Blue uses the background, window and sprite layers, but only uses 8x8-pixel sprites.
As shown above, the address of the BG and window maps can be chosen to be either
$9800-$9BFF
or$9C00-$9FFF
. Similarly the BG/Window tile data region can be chosen to be either$8000-$8FFF
or$8800-$97FF
. Interestingly, if$8800-$97FF
is selected the tile indexes start at$9000
and wrap around. So, for example, tile index 127 is at address$97F0
and tile index 255 is at address$8FF0
.The way the gameboy actually renders the screen is fairly complex (see Pixel FIFO in pandocs), but we can cheat. Most games (including Pokémon Blue) don't change any graphics settings while a scanline is drawing. So we can draw each scanline as it finishes.
Each scanline takes 456 T-states (or 114 M-Cycles) to render. The current T-state for this scanline is sometimes called a "dot", which has a value in the range
0-455
. There are a total of 144 scanlines, and 10 lines of "vertical blank" for a total of 154 lines per screen. So the amount of time it takes to render a full screen is 70224 T-states (or 17556 M-cycles).The vertical blank (or vblank) scanlines are a holdover from CRT screens, which needed this period to reset the electron gun to the upper-left corner of the screen. This happens to be a useful feature for the gameboy too, even though it uses an LCD screen instead, since it is a period when the PPU is not accessing video RAM or other video hardware. This gives the software an opportunity to update video RAM without affecting the on-screen visuals.
Window
The window layer is a region displayed over the background layer. It can be enabled or disabled, and positioned with the
WX
andWY
registers. When it is enabled, it always starts at the location(WX, WY - 7)
and covers the background in a rectangular region toward the lower-right of the screen. It is not scrolled by theSCX
andSCY
registers.The window is used in Pokémon Blue to display the title credits, various animations, menus, etc.
Sprites
The gameboy hardware supports up to 40 sprites per frame, and up to 10 sprites per scanline. Each sprite is defined in the OAM and uses 4 bytes. The bytes are laid out as follows:
The x- and y-coordinates of the sprite is offset by 8 and 16 pixels, respectively. This is done so a sprite can be displayed off the left or top edge of the screen. The sprite will be displayed if the x-coordinate is in the range 0-167 and the y-coordinate is in the range 0-161, inclusive.
Unlike the BG/Window, the sprite tile data is always at
$8000-$8FFF
, but otherwise the tile data is displayed the same way.The priority bit specifies whether this sprite is displayed in front of (0) or behind (1) the background/window. You can see this effect in Pokémon Blue when your player walks in the tall grass. It works because the Background color 0 is considered to be transparent, so the sprite will display through.
The x-flip and y-flip bits will flip the displayed sprite horizontally, vertically, or both, but otherwise will not affect how the sprite is displayed.
The palette bit specifies whether to use
OBP0
orOBP1
when displaying the sprite.Tiles
The background layer, window layer and sprite layer are all made up of 8x8 tiles. Each tile is 16 bytes, using 2 bits-per-pixel. For each pixel, there is a low bit and a high bit. The low bit and high bits are stored in separate bytes, where each byte stores one row:
So for example, if we wanted to draw the pokéball that Ash is tossing on the title screen:
Then we would separate the tile into two bit planes, where the low bit plane only has the values 0 or 1, and the high bit plane only has the values 0 or 2. If we add them together for each pixel, we'll get the image above:
We then interleave these bit planes as shown above to get the 16 bytes for this tile:
Making Pokegb
With that out of the way, I'd like to talk a little about how I made pokegb. I started working on Monday, May 24th. I've already written a gameboy emulator (binjgb), so I knew some of what I'd have to do. But I didn't know a lot about Pokémon Blue's code. The gameboy actually has a lot of features, and I didn't know how many Pokémon Blue used. I was hoping that I could avoid implementing a lot of them to keep the code small.
I decided to start by loading the game in binjgb and tracing the instructions. The trace output looked like this:
This prints out the CPU state after each instruction, along with a disassembly.
When I started pokegb, I spent most of the first few days implementing the CPU. Every time I hit an instruction that wasn't implemented, I'd implement it and see if it would run any further. I made pokegb write a trace output similar to binjgb, so I could see if they were diverging. Pokegb was significantly less accurate so unfortunately I couldn't just diff the two trace outputs. Instead I had to spot check them to see if the output looked way off.
After a few days, the CPU seemed to be in good shape. The game ran for millions of instructions without failing, and matched binjgb pretty closely. At that point I started working on rendering the graphics, and added some input. To my surprise, everything seemed to come together pretty quickly at that point! I only had a few more instructions to implement, and I was able to walk around, have conversations, and have battles. I'll admit that I haven't played very far into the game, so there are very likely bugs!
Code
Let's dive into the obfuscated code. I've made an unobfuscated version, where I've given the variables better names and formatted it using clang-format, but have otherwise not changed it.
Let's start at the top.
Opcode Macros
As mentioned above, most instructions can be grouped into one of ~40 categories. To make it easier to group them, I made 10 different macros:
OP4_NX8
,OP4_NX16_REL
,OP5_FLAG
,OP8_NX8_REL
,OP7_PTR
,OP7_NX8
,OP7_NX8_PTR
,OP49_REL
,OP56_PTR_REL
,OP9_IMM_PTR
.The naming convention I've used is as follows:
OPn
: This macro definesn
instructions.NXm
: Each instruction is spaced out by multiples ofm
.PTR
: This macro also sets theptr8
variable, which points to the address of the 8-bit source register.REL
: This macro also sets theopcode_rel
variable, which is the offset of the given opcode in its group (e.g.OP4_NX16_REL
will haveopcode_rel
values of 0, 16, 32, or 48). This can be helpful for indexing into a register array.IMM
: This macro also defines the "immediate" instruction e.g,ADD A, u8
.FLAG
: This macro defines an instruction that uses thecondition
codes as well as its unconditional version, e.g.CALL NZ, u16
andCALL u16
.Let's take a look at some of the more interesting macros:
First up is
OP5_FLAG
. This macro defines 4 conditional instructions, as well as the unconditional version. Here the variablecarry
is re-purposed to determine whether the operation should be performed:The
F_mask
andF_equals
arrays are defined to encode thecondition
table above, andF
is the flag register:For example, the condition
NZ
means to execute the instruction if theZ
flag is not set. TheZ
flag is bit 7 of the flags register, or 128 in decimal. So we can tell ifZ
is not set by calculating(F & 128) == 0
.Next is
OP9_IMM_PTR
. This macro defines 9 instructions, one for each entry in ther8
table above, as well as for the immediate instruction (which is always at an offset of 70). Each of these instructions operate on an 8-bit operand: one of the 7 registers (A
,B
,C
,D
,E
,H
, orL
), the byte at addressHL
, or the immediate valueu8
.The
read8_pc()
function reads the next byte atPC
(i.e. the next byte in the instruction stream). Theread8()
function can read a byte at any address, but defaults to reading the byte atHL
. And as mentioned above,ptr8
points to one of the 8-bit registers.Registers
It's convenient to be able to access most registers both as their 16-bit pairs or their 8-bit parts. So all CPU registers are defined in an array, along with their initial values.
These are defined in the order
C
,B
,E
,D
,L
,H
,SP High
,SP Low
. This allows us to read a 16-bit register pair directly as long as the host machine is little-endian. Note that SP is never accessed as an 8-bit register, so it doesn't need to be defined here; this may be a place where I could save some additional bytes.The
A
andF
registers are used for many of the instructions directly, so we create references to them by name.The
r8
group described above uses the orderB, C, D, E, F, H, L, (HL), A
. Thereg8_group
array matches this, but uses&F
instead. This is never actually used, so it could have been 0 instead to save a byte.There are three
r16
groups above, and each are represented here, withreg16
beingr16 (group 3)
. Group 2 is actually meant to beBC, DE, HL+, HL-
, whereHL+
andHL-
use the current value ofHL
and then increment or decrement, respectively. That behavior is not included here, but instead used directly in the instructions' implementations via theHL_add
array.As with
A
andF
,HL
andSP
are often used directly, so we've created references to them by name.Finally, there is a helper function for setting CPU flags. The
mask
parameter masks out any flags that are supposed to be set by this instruction, then theZ
,N
,H
,C
parameters will set it to their new value. As a size optimization, theZ
flag is negated here. That's because many instructions set the Z flag if the instruction produces a 0 result, so it's convenient to do the logical not here rather than all places that callset_flags()
.Memory
All the memory for the system is stored in static arrays or memory-mapped regions.
rom0
: a pointer to the first 16KiB bank of ROM (addresses$0000-$3FFF
). This is memory-mapped at the beginning of the program.rom1
: a pointer to the currently mapped bank of ROM1 (addresses$4000-$7FFF
).extram
: a pointer to the entire 32KiB of External RAM. This is memory-mapped at the beginning of the program.extrambank
: A pointer to the currently mapped 8KiB of External RAM.io
: the 512 bytes from addresses$FE00-$FFFF
; this includes OAM, I/O, and high RAM.video_ram
: 8KiB bytes of video RAM.work_ram
: 16KiB of work RAM, used by the game for any purpose.The registers
IF
,LCDC
,LY
, andDIV
are used enough times that it saves source code to reference them by name, rather than via theio
array:The
frame_buffer
is a160 * 144 = 23040
buffer, with anint
for each pixel. I assume that anint
is 4-bytes, so it can store a 32-bit RGBA value.The palette values look strange, but they are stored as signed-integers to save in source code size. When written as hexadecimal values, they look more natural. (Note that these values are stored so the red channel is the least-significant byte and the alpha channel is the most-significant byte):
0xffffffff
0xffffa563
0xffff0000
0xff000000
0xffffffff
0xff8484ff
0xff3a3a94
0xff000000
0xffffffff
0xffffa563
0xffff0000
0xff000000
Memory Access
The CPU can only read or write one byte at a time. The
mem_access()
function implements both reading and writing to save on source code size. Ifwrite == 0
thenval
is ignored. Ifwrite != 0
, then the return value is ignored.First, note that all memory accesses call
tick()
. This increments the cycle counter by 4, which is the number of T-states required to do an 8-bit memory access.As a size-optimization, both
read8()
andwrite8()
takeHL
as the default address. This requireswrite8()
to be written "backward", withval
as the first parameter andaddr
as the second.In
mem_access()
, the address is shifted right by 12. This extracts the top nibble of the address, which can then be used to determine which region of memory was accessed.For cases 2 and 3, the address being accessed is in the range
$2000-$3FFF
. If this region is written to, it will remap ROM bank 1 (see the Memory Bank Controllers section above). The bank number is shifted left by 14, since each bank is 1 << 14 = 16KiB.Either way, we fall through to cases 0 and 1 below. For 0 and 1, the address being accessed is in the range
$0000-$1FFF
, which is in ROM bank 0.For cases 4 and 5, the address being accessed is in the range
$4000-5FFF
. If the region is written to, it will remap the External RAM bank. Either way, we fall through to cases 6 and 7 below, which covers the ROM bank 1 region.Cases 8 and 9 are the video RAM region, addresses
$8000-$9FFF
. On the real gameboy the video RAM cannot be updated during certain times (e.g. when the PPU is accessing it), but we assume the game is well-behaved and allow it at all times.Cases 10 and 11 are in the external RAM region, addresses
$A000-$BFFF
. On real gameboy there are ways to disable read and write access to this region, but again we assume that the game is well-behaved and allow it at all times.Case 15 is the most complex. Here we first test if the address is greater than or equal to 65024 (
$FE00
). If so, then we know that we're accessing somewhere in theio
region, which includes OAM, I/O, and High RAM.If the address is a write to 65350 (
$FF46
), then the game is requesting a DMA transfer to OAM. This is an extremely common way for games to copy sprite data into OAM, since it is much faster than having the CPU do it manually. The value written to$FF46
is used as the high byte of the source address. So, for example, if the value is$C1
, then 160 bytes will be copied from$C100-$C19F
to OAM.If the address read from is 65280 (
$FF00
), then the game is reading theJOYP
register which gives the current joypad state. The value returned depends on what was previously written toJOYP
, either the directions buttons if bit 4 ofJOYP
is 0, or the face buttons if bit 5 ofJOYP
is 0. If neither is 0, thenJOYP
returns 255 ($FF
).Finally, if the memory access address is less than 65024, then it must be an access to work RAM, either the primary address range at
$C000-$DFFF
, or the mirror range at$E000-$FDFF
. In either case the address is masked and a write or read is performed.A few more functions are used to reduce code size. First
read8_pc()
(mentioned above), which reads a byte from the instruction stream.Next
read16()
, which reads a 16-bit value from any address, but defaults toPC
. The gameboy CPU is little-endian, so this writes the least-significant byte first. This function is also used to implement thePOP
instruction. Note thataddr
must be incremented in separate statements to avoid undefined behavior.Then
push()
, which pushes a 16-bit value onto the stack. The most-significant byte is written first, because the value is written asSP
is decremented. Also note that thetick()
is called becausePUSH
always takes an additional M-cycle.Initialization
The
main()
function begins by initializing the ROM and External RAM areas:Since we know that Pokémon Blue is 1MiB, we can directly
mmap
it. This saves some space overfopen
/fread
as long as we use numbers instead ofPROT_READ
, etc.Similarly, we can
mmap
the External RAM file, though this is even more convenient since any writes to this region will automatically be written to the save file.One slightly annoying quirk is that the file must be truncated to the correct length, or else writes will crash the emulator. This behavior seems to work on Linux, but fails on macOS (see pokegb issue 1):
The
LCDC
andDIV
I/O registers are set to initial values to match binjgb. It's likely that this isn't actually required, and could be removed to save some bytes.Next we initialize SDL:
Then create a window at 5x the gameboy resolution, along with a renderer that can draw to it:
And create a texture at the gameboy resolution that we will lock and draw to later:
And finally grab the keyboard state array. Conveniently, this will be automatically updated by SDL whenever we poll for events:
CPU
After initialization, we start a loop that steps the CPU and updates the graphics. There are a few different states that the CPU can be in: running normally, "halted" waiting for an interrupt, and handling an interrupt.
That is all handled at the start of the loop:
First, the the current cycle counter is saved so we know how many cycles to update the PPU later. Then we check if any bit is set in all of Interrupt Master Enable flag (
IME
), Interrupt Request (IF
), and Interrupt Enable (io[511]
). Since pokegb only supports the vblank interrupt, then we know that we only need to check for bit 0, but it's smaller to check for any bit here.If an interrupt occurs, the CPU is no longer halting, and the
IME
andIF
flags are cleared. The CPU then performs the equivalent of aCALL
to address$0040
(64) which is the vblank interrupt handler.If the CPU is halted, then we simply tick the clock and loop until an interrupt occurs.
Otherwise, we read the next byte in the instruction stream and try to execute it.
Instructions
This is the real meat of the emulator. As we saw above, there are a lot of instructions, and most of them are needed to run Pokémon Blue. Let's start with the simplest instruction,
NOP
, which does nothing:Off to a good start! Next is
LD r16, u16
, which copies a 16-bit value from the instruction stream into a 16-bit register pair. This handles opcodes 1, 17, 33, and 49. Since these opcodes are spaced by multiples of 16, we can shift right by 4 to index intor16 (group 1)
:Next up,
LD (r16), A
, which writesA
into memory at a given address. This also can optionally increment or decrementHL
. Notice that we useopcode_rel
for this; it's not required in this case sinceopcode / 16
is equal toopcode_rel / 16
. But for other instructions it will be useful:Then
INC r16
andDEC r16
, which increment or decrement a register pair. These take an additional M-cycle.Likewise there are
INC r8
andDEC r8
which increment or decrement an 8-bit register. Unlike their 16-bit equivalents, these update CPU flags.goto
is used to share some code:The
r8
group also includes(HL)
, which means to use the byte at the addressHL
instead. It is simpler to handle these separately, usingread8()
andwrite8()
.These instructions also have the code to handle setting flags. Remember that the first parameter is the mask. 16 is the value of the carry flag, so this mask means that the carry flag will be preserved. The other flags are set according to the rest of the arguments:
Z
is set if the result is zero,N
is set if the operation was a decrement, andH
is set if there was an overflow to bit 4 of the result.Next is
LD r8, u8
, which loads a value from the instruction stream into a register or(HL)
.Next,
RLCA
andRRCA
, which rotate theA
register left or right by one bit. The bit that is rotated out is copied to the carry flag, as well as moved to the other end.A += A
is used instead ofA << 1
, andA / 2
is used instead ofA >> 1
to reduce code size.RLA
is similar toRLCA
, but it rotates theA
register "through" the carry flag, meaning that it is a 9-bit rotate. Bit 7 ofA
is moved into the carry flag, and the carry flag is moved into bit 0 ofA
.When the flags are set, the mask is 0, which means that all flags are set:
Z
is always 1,N
andH
are 0, andC
is set tocarry
:The expression
F / 16 % 2
looks odd, but extracts the carry bit fromF
into bit 0.Now onto the
JR
instructions. They branch thePC
to a new location, relative to where the currentPC
is, either forward or backward. That's whytmp8
is cast to anint8_t
. Thecarry
variable is used as a condition, which is automatically set by theOP5_FLAG
macro. Notice also that this uses the comma operator to save a byte of source code.Next is a beast of an instruction, Decimal Adjust for Addition (
DAA
). It's intended to be used for handling arithmetic on Binary Coded Decimal (BCD) values.The basic idea is relatively simple: a two-digit decimal number can be stored in a byte where each nibble stores one digit. For example, the decimal number 42 can be stored as the hexadecimal number
$42
(66). This is useful for older harware that didn't have a divide instruction, like the gameboy.DAA
is used after addition or subtraction to fix the BCD value. So if we take the decimal numbers 42 and add 9, we expect to get 51. But if we do this with BCD values, we'll get$4B
instead. ExecutingDAA
after this addition will adjust$4B
to$51
as expected.The first
if
determines whether we need to adjust the least-significant nibble, and the secondif
determins whether we need to adjust the most-significant nibble. The adjustment value added to (or subtracted from)A
is either$00
,$06
,$60
, or$66
, depending on the values of theH
,C
, andN
flags, and the current value of theA
register.For our example,
$06
will be added to$4B
to produce$51
.Now a simpler one,
CPL
. This instruction performs the ones' complement ofA
, which means flipping all of its bits.SCF
andCCF
set or complement the carry flag. The implementation combines them to save on space.The
LD r8, (HL)
andLD (HL), r8
instructions read from or write to memory. In this case there is an exception to the pattern; normally ther8
group includes(HL)
, but there is noLD (HL), (HL)
instruction. That instruction is insteadHALT
.Speaking of which, here's
HALT
. On the real gameboy, there are some complexities with howHALT
behaves, but here we merely set thehalt
variable to 1.The
LD r8, r8
group covers 49 different instructions, not including(HL)
. It has several seemingly redundant instructions, such asLD B, B
andLD C, C
. Some emulators use these instructions as a breakpoint instead, since they otherwise have no purpose:Here's one of the trickier set of instructions. There are 4 different operations handled here: addition (
ADD
), addition with carry (ADC
), subtraction (SUB
), and subtraction with carry (SBC
). In other words, each instruction does the following:ADD
:A = A + operand + 0
ADC
:A = A + operand + carry
SUB
:A = A - operand - 0
SBC
:A = A - operand - carry
They all can be implemented using the same code. In regular mathematics, we can rewrite subtraction as addition (e.g.
3 - 5 = 3 + -5
). We can do the same for two's complement arithmetic, but instead of negating the operand, we can complement it (~
) and add one, which is equivalent:ADD
:A = A + operand + 0
ADC
:A = A + operand + carry
SUB
:A = A + ~operand + 1 - 0
SBC
:A = A + ~operand + 1 - carry
But since
carry
is either 0 or 1, we can simplify this to:ADD
:A = A + operand + 0
ADC
:A = A + operand + carry
SUB
:A = A + ~operand + 1
SBC
:A = A + ~operand + ~carry
The only additional complication is how to deal with the flags. The
N
flag is simplest, it is 0 for addition and 1 for subtraction. TheH
flag is set if there is an overflow into bit 4 of the result. It turns out that we can correctly calculate this by seeing if the sum of the low nibbles ofA
,operand
, andcarry
is greater than 15. For subtraction, we take this result and flip it, i.e. we only set theH
flag if the sum is not greater than 15. Similarly, theC
flag checks for an overflow out of bit 7 of the result, so we check whether the sumA + operand + carry
is greater than 255.Next are the common bitwise operations,
AND
,OR
, andXOR
. They are relatively straightforward since c++ has built-in operators for them:Next up is the collection of
CP
instructions. They have the same behavior asSUB
, except they don't modifyA
, only the CPU flags:Now we get to the set of
RET
instructions. They return from a function call by popping the return address off of the stack. Like theJR
instructions above, they come in "conditional" and "unconditional" flavors. TheOP5_FLAG
macro setscarry
to the result of the condition. But there's alsoRETI
, which is the same asRET
followed byEI
(enable interrupts).Then onto
PUSH
andPOP
instructions. These write to or read from the stack, and then move the stack pointer accordingly. They don't have to be used for stack access, however. Pokémon Blue actually uses them to decode some graphics during vblank, sincePOP
is a fast way to read 16 bits into a register pair.The
JP
instructions is similar to theJR
instructions, except that the jump is absolute instead of relative.The
CALL
instructions are similar to theJP
instructions, but the also push the oldPC
to the stack first:The gameboy also has some instructions to make it easier to access the top 256 bytes of the address space (
$FF00-$FFFF
). The byte offset is either read from the instruction stream as($FF00 + u8)
or from theC
register as($FF00 + C)
.The
LD A, (u16)
instruction is also implemented here, which reads an address from memory, then reads that byte into theA
register. Its implementation is combined withLD A, ($FF00 + u8)
to save space.The counterpart to
LD A, (u16)
isLD (u16), A
, which reads an address from the instruction stream, then writes the value in theA
register to that address. This probably could have been combined with case 224 and 226 above to save space.All of the other jump instructions we've seen so far take an immediate offset or address, but
JP HL
is used when you want to jump to an arbitrary address, for example, to implement a jump table.The Interrupt Master Enable (
IME
) flag can be modified with theDI
andEI
instructions, either disabling or enabling the flag.Finally, there are two instructions that copy between
HP
andSP
. TheLD HL, SP + i8
adds a signed value toSP
and stores it in theHL
register. On the other hand,LD SP, HL
just copies theHL
register directly intoSP
.$CB Instructions
The
$CB
instruction is an instruction prefix. All$CB
instructions are two bytes, and the second byte determines which operation it is (see opcode group 3 above).First off are
RLC
andRRC
, which are similar toRLCA
andRRCA
above but can be used for any register (as well asHL
) Interestingly, theRRC (HL)
instruction is also used by Pokémon Blue, butRLC (HL)
doesn't seem to be. It works the same asRRC r8
, but operates on the memory at(HL)
:Next are
RL
andRR
. LikeRLA
andRRA
above, these rotate "through" the carry flag for a 9-bit rotate, but they can be used for any register.Now Shift Left Arithmetic (
SLA
) and Shift Right Arithmetic (SRA
). "Arithmetic" means thatSRA
behaves like a signed divide by 2. In two's complement, that means that the most-significant bit (i.e. the sign bit) is not changed. For example, performingSRA
on$43
(67 in decimal,01000011
in binary) results in$21
(33 in decimal,00100001
in binary). On the other hand, performingSRA
on$F0
(-16 in decimal,11110000
in binary) results in$F8
(-8 in decimal,11111000
in binary).On the other hand, Shift Right Logical (
SRL
) fills 0 into the most-significant bit after a shift. So for example, performingSRL
on$F0
(240 in decimal,11110000
in binary) results in$78
(120 in decimal,01111000
in binary).The
SWAP
instructions swap the low and high nibble of the byte. For example,SWAP
on$A3
results in$3A
.The
BIT
instructions check whether a given bit is set, and set theZ
flag accordingly. The bit to check is encoded in the opcode itself, so we can useopcode_rel / 8
to extract it:The
RES
instructions reset (i.e. clear) a given bit in an 8-bit register or(HL)
. Again, the bit to clear is encoded in the opcode.And finally, the
SET
instructions set a given bit in an 8-bit register or(HL)
.With that, we've completed the CPU implementation, in 1955 bytes (after removing all unnecessary whitespace).
Graphics
Now that we've executed the CPU, we want the PPU to catch up. The current CPU cycle count is in
cycles
, and the cycle count before executing the instruction is inprev_cycles
. So we can update the PPU bycycles - prev_cycles
T-states.The
DIV
I/O register counts the number of T-states as well, but only stores the high byte of this value at address$FF04
. Fortunately$FF03
is left unused, so we can treatDIV
as a 16-bit value starting at$FF03
instead, and increment it directly:First off, we only want to draw the screen if the LCD is enabled (see
LCDC
bit 7). So we check for that, and otherwise reset the current PPU dot and the current scanline counter to zero:Next we want to trigger a vblank interrupt when the screen is finished rendering. This happens after the 143rd line finishes, so we check whether the current "dot" was 0, and whether the current scanline is 144, and set the interrupt request to 1. This could have been just
IF=1
, since we don't support any other interrupts:After that is the fun bit, actually drawing a scanline! Since we're keeping things simple, we only render a scanline when it finishes, at dot 456. It is possible on the real gameboy to modify PPU settings during the scanline, but those don't happen in Pokémon Blue.
After rendering the line, we increment
LY
and resetppu_dot
back to 0.If the current scanline is less than 144, we draw it. This includes rendering the background and window layers, and any sprites that are on this line. We loop over every x-coordinate as
tmp
:First we want to determine if we're drawing the window layer or the background layer. Like everything else, the real logic to determine this is somewhat complex, but we can do a simpler version. If the window is enabled, and the current pixel we're drawing
(tmp, LY)
is greater than the upper-left coordinate of the window(WX - 7, WY)
, then we are drawing the window. Recall from above thatWY
isio[330]
andWX
isio[331]
:Now we want to find out the pixel offset of the current pixel, either into the background map or the window map. If this pixel is displaying the background, then we offset by the background scroll registers
(SCX, SCY)
. If instead this pixel is displaying the window, then we offset by the window's upper left corner(WX - 7, WY)
. Again, note thatSCX
isio[323]
andSCY
isio[322]
:Now that we know the pixel offset, we can determine which tile is being drawn. The background and window maps start at either address
$9800
or$9C00
, depending on the settings inLCDC
. Since we're reading from video RAM directly (rather than through theread8()
function), we want to use the offset into video RAM instead,$1800
or$1C00
. This can be rewritten as$6 << 10
and$7 << 10
.The window map is at
$1C00
ifLCDC
bit 6 is set (== 64
), and the background map is at$1C00
ifLCDC
bit 3 is set (== 8
). So we can determine the base map address by checking the correct bit, depending onis_window
.Finally, the tile offset within the map can be found by taking the pixel offset and dividing by 8, since there are 8 pixels per tile in both dimensions. The map is 32x32 tiles, and each tile is one byte, so multiplying the y tile offset by 32 gives the final expression.
We also set the
palette_index
to 0 by default (which means to useBGP
).Now that we know which tile is being rendered, we have to read the pixel data for that tile. The starting address of the tile data is either
$8000
or$8800
. But as mentioned above, if the starting address is$8800
, then tile 0 is actually at address$9000
, and then the tile indexes wrap around back to$8800
at tile index 128.LCDC
bit 4 (== 16
) determines this behavior.We can get the pixel row of the tile to draw with the expression
y_offset % 8
. Each row is 2` bytes (since it has two bitplanes) and each tile is 16 bytes:This gives us the low byte of this row in
tile_data[0]
and the high byte intile_data[1]
.x-offset & 7
will give us the offset of the current pixel in this tile's row. However, you may have noticed above that a tile's row has the leftmost pixel as the most-significant bit. So, for example, if we are at pixel offset 0, we want to use bit 7 of the row. This means that the amount to shift right is actually7 - x_offset
, which is the same asx_offset ^ 7
.Let's say we're drawing the fifth pixel of the second row of Ash's pokéball:
Then
x_offset
is 3, so when we shift right by 3, we get:Then we perform
% 2
:And then we shift the high byte left by 1, and add the bytes together:
This gives us a final color of 2.
After we've calculated the background or window pixel color, we may have to draw a sprite on top of it. Sprites can be enabled or disabled with bit 1 of
LCDC
:OAM has enough space for 40 sprites, each one requiring 4 bytes. On a real gameboy only 10 sprites can be drawn per scanline, but we'll ignore that limit. OAM starts at the beginning of
io
(address$FE00-$FE9F
).The sprite's coordinate is
(sprite[1] - 8, sprite[0] - 16)
. Similar to the background tile, we can find which row and column of the sprite tile by subtracting the current pixel's coordinate from the sprite's. The sprite's tile is drawn if the x- and y-offset are both less than 8:As with the background tile, we want to xor the x-offset with 7 so the most-significant bit is also the leftmost bit. However, a sprite can also be flipped horizontally if bit 5 of
sprite[3]
is set. In that case we can leave the x-offset as it is by xor'ing it with 0:Again, like background tiles, we want to fetch the tile data row for this sprite. The sprite's tile is in
sprite[2]
. The sprite can also be flipped vertically, so we xor the sprite's y-offset with 7 if bit 6 ofsprite[3]
is set.We can then extract the color of the sprite's pixel:
A sprite's pixel is only drawn if the sprite's pixel is non-transparent. But the background pixel also must be transparent or the sprite must have priority over the background. In other words, if the background pixel is non-transparent and the sprite doesn't have priority, then the sprite's pixel will not be drawn.
The sprite priority is bit 7 of
sprite[3]
; if it is 1, then the sprite is drawn behind the background.The sprite can use one of two palettes,
OBP0
andOBP1
. If bit 3 ofsprite[3]
is 1, then paletteOBP1
is used. We wantpalette_index
to be 1 forOBP0
and 2 forOBP1
, so we add one to the value.!!
is a little trick to convert a non-zero value to 1:Now that we know the
color
andpalette_index
, we can look it up in thepalette
to find the actualRGBA
color to draw. Each palette is a single byte, stored like this:io
offsetAdding 327 to
palette_index
gives us the offset inio
for the palette. We can then shift the palette byte to the right by2 * color
to isolate the palette entry we want.We'll write the final
RGBA
color into the frame buffer at the current pixel's x- and y-coordinate,(tmp, LY)
.Rendering with SDL
Once we've finished drawing all the lines, we can finally draw the framebuffer to the window! First we have to lock the texture that we created before, so we have access to its pixels. The
tmp
parameter stores the pitch of the texture, i.e. the width of a row of the texture in bytes, which may include extra padding.For each row of the frame buffer, we copy it into the texture. The total number of bytes copied per row is
160 * 4 == 640
, since anRGBA
pixel is 4 bytes:Then we unlock the texture, render it to the window's back buffer, and then present it to the screen:
As the very final step, we poll for events, checking to see if the user quit. This also updates
key_state
, so we can read from it directly.Formatting the Code
Before I sent the tweet, I wanted to make the code look cool. I remember looking at some examples from the IOCCC and thought I'd take a chance at formatting the code into a picture. See, for example, Don Yang's winning entry from the 2020 competition:
To start, I took the source code I had, and renamed all the variables to single letter names, where possible. I then removed all whitespace to see how many characters I was using, which was 4150 bytes, not including the
#include
and#define
lines.My first thought was to use the Pokemon logo itself. I tried to make a good 1 bit-per-pixel version, but it ended up looking like garbage when I tried to fit the source code to it:
My next thought was to use a single pokéball. I used the awesome Desmos online calculator to try to figure out how big each circle should be interactively:
This looks pretty good, but it's a very large pokéball with a diameter of 96. I realized that three pokéballs side-by-side would probably work better, giving a width of 150 and a height of 50:
I then recreated as a 150x50 pixel image so that I could convert it to text and use it as a guide for my source code layout:
The result ends up stretched vertically, since most fonts are taller than they are wide, but I still think it looks good! Maybe next time I'll try to take that into account.
Conclusion
I had a lot of fun making pokegb, and I hope you enjoyed reading about it! If I left anything out or made any mistakes, feel free to send me a message on twitter @binjimint. Thanks to everyone on the emudev discord and the gbdev discord for making such a great community. Cheers!
via GitHub Pages | Websites for you and your projects, hosted directly from your GitHub repository. Just edit, push, and your changes are live.
June 6, 2021 at 01:00PM
The text was updated successfully, but these errors were encountered: