18. IDE Hard Disk Driver

Starting from this chapter, we will start designing and implementing the next big part of an OS kernel: the storage stack. A significant amount of input/output work of an OS is from/to external storage devices, such as tapes, floppy disks, hard disk drives, and solid-state drives. Being able to store data on persistent media inside these devices and fetch them back across poweroffs brings persistence into the system.

The support for persistence is crucial in any computer system. The OS code itself must be stored on a storage device and be loaded into main memory at booting. User program binaries are stored, waiting to be loaded and executed. Useful user programs (and the system itself) will likely need to save data or state across poweroffs and failures. Without persistence, the system could only do transient things on volatile data.

We now start building layers of a typical storage stack in this and the next few chapters to enable persistence in Hux.

Main References of This Chapter

Scan through them before going forth:

I/O Devices chapter of the OSTEP book: concepts of I/O, what is a storage stack ✭
Hard Disk Drives (HDD) chapter of the OSTEP book: internals of a hard disk device
IDE disk driver implementation of xv6
ATA PIO Mode page: "Registers" + "IRQs" sections
ATA Command Matrix page

The Storage Stack

The collection of components that bridge read/write operations with storage device requests is often called the storage stack. People call it a software stack because it tends to be clearly layered (though not necessarily true in modern OSes), similar to the network stack.

A storage stack typically consists of the following layers, from top to bottom ✭:

I/O-related system calls, e.g., read(), write(), etc.
(often) Virtual file system (VFS) layer that provides a unified interface for file systems, e.g., vfs_open(), vfs_read(), etc.
Actual file systems implementations, e.g., Ext2/3/4, XFS, BtrFS, FAT, NTFS, etc.
(often) A general block device interface layer
Block device drivers for various kinds of storage devices, on various types of connections, and with various types of hardware protocols
Physical storage devices, perhaps with in-device controllers

In Hux, we will implement a beginner-level solution at each layer bottom-up. Let's begin with a basic device driver for classic (Parallel) ATA hard disks (also referred to as IDE disks due to historical reasons).

For our small Hux kernel, the entire OS kernel is compiled into a single binary and packed with a mature bootloader, GRUB, into a CD-ROM image. GRUB itself implements basic I/O support for loading an OS kernel image, which means that we don't need to worry about a "chicken-or-egg" problem at booting. Our storage stack will assume that the OS has been fully loaded.

For a modular kernel or microkernel, it faces the problem that it needs to load modules from a disk, but the code for allowing disk access (disk drivers, file system, etc.) are within those modules. The solution is to include a small initialization ramdisk (initrd) in the kernel image to be used at early boot time, which provides a basic disk driver for loading modules. After storage modules have been loaded, the kernel switches to use the complete storage stack.

ATA Port I/O (PIO) Mode

To implement a device driver, understanding the device interface standards is a must. In the basic ATA port I/O (PIO) mode of an IDE hard disk, all commands and data are sent to & retrived from the device through I/O ports with the inx/outx instructions.

This means the CPU is involved in the transferring of every byte of data, which is far from ideal. Modern systems all use direct memory access (DMA) techniques.

IDE Registers Mapping

The hardware PCI chipset almost always supports at least two ATA buses, of which the first one is called the primary bus and the second one is called the secondary bus. On each bus, there could be no more than two IDE hard disk drives, of which one is called the master drive (index 0) and the other is called the slave drive (index 1, very bad naming due to historical reasons). Each ATA bus has a mapping between a set of I/O port numbers to specific device registers (which are state/data registers inside the device). On each bus, either the master drive or the slave drive could be selected at any time. Reading/Writing on bus 0x1F0 means receiving from/sending to the currently selected drive on the primary bus, etc.

In Hux, assume that we only have one IDE disk connected to the primary bus on index 0. Following the ATA PIO Mode page of the OSDev wiki (which has tons of information on getting IDE disks working), define a set of macros that describe the standardized mapping and format of primary bus device registers @ src/device/idedisk.h:

/** Hard disk sector size is 512 bytes. */
#define IDE_SECTOR_SIZE 512


/**
 * Default I/O ports that are mapped to device registers of the IDE disk
 * on the primary bus (with I/O base = 0x1F0).
 * See https://wiki.osdev.org/ATA_PIO_Mode#Registers.
 */
#define IDE_PORT_IO_BASE        0x1F0
#define IDE_PORT_RW_DATA        (IDE_PORT_IO_BASE + 0)
#define IDE_PORT_R_ERROR        (IDE_PORT_IO_BASE + 1)
#define IDE_PORT_W_FEATURES     (IDE_PORT_IO_BASE + 1)
#define IDE_PORT_RW_SECTORS     (IDE_PORT_IO_BASE + 2)
#define IDE_PORT_RW_LBA_LO      (IDE_PORT_IO_BASE + 3)
#define IDE_PORT_RW_LBA_MID     (IDE_PORT_IO_BASE + 4)
#define IDE_PORT_RW_LBA_HI      (IDE_PORT_IO_BASE + 5)
#define IDE_PORT_RW_SELECT      (IDE_PORT_IO_BASE + 6)
#define IDE_PORT_R_STATUS       (IDE_PORT_IO_BASE + 7)
#define IDE_PORT_W_COMMAND      (IDE_PORT_IO_BASE + 7)

#define IDE_PORT_CTRL_BASE      0x3F6
#define IDE_PORT_R_ALT_STATUS   (IDE_PORT_CTRL_BASE + 0)
#define IDE_PORT_W_CONTROL      (IDE_PORT_CTRL_BASE + 0)
#define IDE_PORT_R_DRIVE_ADDR   (IDE_PORT_CTRL_BASE + 1)


/**
 * IDE error register flags (from PORT_R_ERROR).
 * See https://wiki.osdev.org/ATA_PIO_Mode#Error_Register.
 */
#define IDE_ERROR_AMNF  (1 << 0)
#define IDE_ERROR_TKZNF (1 << 1)
#define IDE_ERROR_ABRT  (1 << 2)
#define IDE_ERROR_MCR   (1 << 3)
#define IDE_ERROR_IDNF  (1 << 4)
#define IDE_ERROR_MC    (1 << 5)
#define IDE_ERROR_UNC   (1 << 6)
#define IDE_ERROR_BBK   (1 << 7)


/**
 * IDE status register flags (from PORT_R_STATUS / PORT_R_ALT_STATUS).
 * See https://wiki.osdev.org/ATA_PIO_Mode#Status_Register_.28I.2FO_base_.2B_7.29.
 */
#define IDE_STATUS_ERR (1 << 0)
#define IDE_STATUS_DRQ (1 << 3)
#define IDE_STATUS_SRV (1 << 4)
#define IDE_STATUS_DF  (1 << 5)
#define IDE_STATUS_RDY (1 << 6)
#define IDE_STATUS_BSY (1 << 7)


/**
 * IDE command codes (to PORT_W_COMMAND).
 * See https://wiki.osdev.org/ATA_Command_Matrix.
 */
#define IDE_CMD_READ           0x20
#define IDE_CMD_WRITE          0x30
#define IDE_CMD_READ_MULTIPLE  0xC4
#define IDE_CMD_WRITE_MULTIPLE 0xC5
#define IDE_CMD_IDENTIFY       0xEC


/**
 * IDE drive/head register (PORT_RW_SELECT) value.
 * See https://wiki.osdev.org/ATA_PIO_Mode#Drive_.2F_Head_Register_.28I.2FO_base_.2B_6.29.
 */
#define IDE_SELECT_DRV (1 << 4)
#define IDE_SELECT_LBA (1 << 6)

static inline uint8_t
ide_select_entry(bool use_lba, uint8_t drive, uint32_t sector_no)
{
    uint8_t reg = 0xA0;
    if (use_lba)        /** Useing LBA addressing mode. */
        reg |= IDE_SELECT_LBA;
    if (drive != 0)     /** Can only be 0 or 1 on one bus. */
        reg |= IDE_SELECT_DRV;
    reg |= (sector_no >> 24) & 0x0F;  /** LBA address, bits 24-27. */
    return reg;
}

Please refer to the wiki page for more information.

Data Transfer Protocol

IDE devices generate interrupts on request completion by default. Since we have multi-tasking enabled in our system (multiple processes time-sharing the CPU), we probably prefer interrupt-based disk requests completion over polling-based methods ✭.

Polling-based: after starting a request, let the CPU loop on checking the status port value until the busy is cleared or error occurrs
- Generally shorter respond time compared to interrupts
- Code logic of polling is much simpler to write
- Wastes a lot of CPU cycles if it could be used to run other processes
Interrupt-based: after starting a request, let the caller process block on this request and schedule some other processes to run; on a request completion interrupt, wake up the blocking process on that request
- Suits our multi-tasking system better

We create a new structure that describes a block device request. Since this structure will be tailored with the upper-layer file system in the future, we put it @ src/filesys/block.h:

/** All block requests are of size 1024 bytes. */
#define BLOCK_SIZE 1024


/**
 * Block device request buffer.
 *   - valid && dirty:   waiting to be written to disk
 *   - !valid && !dirty: waiting to be read from disk
 *   - valid && !dirty:  normal buffer with valid data
 *   - !valid && dirty:  cannot happen
 */
struct block_request {
    bool valid;
    struct block_request *next;     /** Next in device queue. */
    uint32_t block_no;              /** Block index on disk. */
    uint8_t data[BLOCK_SIZE];
};
typedef struct block_request block_request_t;

This structure basically describes a buffer of BLOCK_SIZE bytes data associated with some state. All requests to the disk will be sent in the granularity of such BLOCK_SIZEd blocks. BLOCK_SIZE could be any reasonable multiple of disk SECTOR_SIZE. We use 28-bit linear block addressing (LBA) mode, which simply lays out the sectors in subsequent blocks linearly in the 28-bit LBA address number.

We maintain a software disk requests queue in the form of a simple linked list of block_request_t structures. Head points to the current active request on the fly, tail points to the lastest request pushed in queue, and each structure's next field points to the next one in queue.

Initialization steps @ src/device/idedisk.c:

/** Data returned by the IDENTIFY command during initialization. */
static uint16_t ide_identify_data[256];


/** IDE pending requests software queue. */
static block_request_t *ide_queue_head = NULL;
static block_request_t *ide_queue_tail = NULL;


/**
 * Wait for IDE disk on primary bus to become ready. Returns false on errors
 * or device faults, otherwise true.
 */
static bool
_ide_wait_ready(void)
{
    uint8_t status;
    do {
        /** Read from alternative status so it won't affect interrupts. */
        status = inb(IDE_PORT_R_ALT_STATUS);
    } while ((status & (IDE_STATUS_BSY | IDE_STATUS_RDY)) != IDE_STATUS_RDY);

    if ((status & (IDE_STATUS_DF | IDE_STATUS_ERR)) != 0)
        return false;
    return true;
}


/**
 * Initialize a single IDE disk 0 on the default primary bus. Registers the
 * IDE request interrupt ISR handler.
 */
void
idedisk_init(void)
{
    /** Register IDE disk interrupt ISR handler. */
    isr_register(INT_NO_IDEDISK, &idedisk_interrupt_handler);
                 // 46, add macro definition in `src/interrupt/isr.h`

    /** Select disk 0 on primary bus and wait for it to be ready */
    outb(IDE_PORT_RW_SELECT, ide_select_entry(true, 0, 0));
    _ide_wait_ready();
    outb(IDE_PORT_W_CONTROL, 0);    /** Ensure interrupts on. */

    /**
     * Detect that disk 0 on the primary ATA bus is there and is a PATA
     * (IDE) device. Utilzies the IDENTIFY command.
     */
    // Kind of tedious, please refer to code in place...

    ide_queue_head = NULL;
    ide_queue_tail = NULL;
}

Function for submitting a block request and waiting for its completion (which will be called by the upper-layer file system later) @ src/device/idedisk.c:

/**
 * Start a request to IDE disk.
 * Must be called with interrupts off.
 */
static void
_ide_start_req(block_request_t *req)
{
    assert(req != NULL);
    // if (req->block_no >= FILESYS_SIZE)
    //     error("idedisk: request block number exceeds file system size");

    uint8_t sectors_per_block = BLOCK_SIZE / IDE_SECTOR_SIZE;
    uint32_t sector_no = req->block_no * sectors_per_block;

    /** Wait for disk to be in ready state. */
    _ide_wait_ready();

    outb(IDE_PORT_RW_SECTORS, sectors_per_block);   /** Number of sectors. */
    outb(IDE_PORT_RW_LBA_LO,  sector_no         & 0xFF);   /** LBA address - low  bits. */
    outb(IDE_PORT_RW_LBA_MID, (sector_no >> 8)  & 0xFF);   /** LBA address - mid  bits. */
    outb(IDE_PORT_RW_LBA_HI,  (sector_no >> 16) & 0xFF);   /** LBA address - high bits. */
    outb(IDE_PORT_RW_SELECT, ide_select_entry(true, 0, sector_no)); /** LBA bits 24-27. */

    /** If dirty, kick off a write with data, otherwise kick off a read. */
    if (req->dirty) {
        outb(IDE_PORT_W_COMMAND, (sectors_per_block == 1) ? IDE_CMD_WRITE
                                                          : IDE_CMD_WRITE_MULTIPLE);
        /** Must be a stream in 32-bit dwords, can't be in 8-bit bytes. */
        outsl(IDE_PORT_RW_DATA, req->data, BLOCK_SIZE / sizeof(uint32_t));
    } else {
        outb(IDE_PORT_W_COMMAND, (sectors_per_block == 1) ? IDE_CMD_READ
                                                          : IDE_CMD_READ_MULTIPLE);
    }
}

/** Poll until an IDE request has been served. */
static void
_ide_poll_req(block_request_t *req)
{
    /** If is a read, get data now. */
    if (!req->dirty) {
        if (_ide_wait_ready()) {
            /** Must be a stream in 32-bit dwords, can't be in 8-bit bytes. */
            insl(IDE_PORT_RW_DATA, req->data, BLOCK_SIZE / sizeof(uint32_t));
            req->valid = true;
        }
    } else {
        if (_ide_wait_ready())
            req->dirty = false;
    }
}


/**
 * Start and wait for a block request to complete. If request is dirty,
 * write to disk, clear dirty, and set valid. Else if request is not
 * valid, read from disk into data and set valid. Returns true on success
 * and false if error appears in IDE port communications.
 */
bool
idedisk_do_req(block_request_t *req)
{
    process_t *proc = running_proc();

    if (req->valid && !req->dirty)
        error("idedisk_do_req: request valid and not dirty, nothing to do");
    if (!req->valid && req->dirty)
        error("idedisk_do_req: caught a dirty request that is not valid");

    cli_push();

    /** Append to IDE pending requests queue. */
    req->next = NULL;
    if (ide_queue_tail != NULL)
        ide_queue_tail->next = req;
    else
        ide_queue_head = req;
    ide_queue_tail = req;

    /** Start he disk device if it was idle. */
    if (ide_queue_head == req)
        _ide_start_req(req);

    /** Wait for this request to have been served. */
    proc->wait_req = req;
          // Add this field to the PCB in `src/process/process.h`
    process_block(ON_IDEDISK);
                  // Add this to the enum in `src/process/process.h`
    proc->wait_req = NULL;

    /**
     * Could be re=scheduld when an IDE interrupt comes saying that this
     * request has been served. If valid is not set at this time, it means
     * error occurred.
     */
    if (!req->valid || req->dirty) {
        warn("idedisk_do_req: error occurred in IDE disk request");
        cli_pop();
        return false;
    }

    cli_pop();
    return true;
}

Request completion will be handled in IDE disk interrupts. The standard IRQ number for ATA primary bus interrupts is 14 (which maps to interrupt number 32 + 14 = 46 in Hux). Interrupt handler @ src/device/idedisk.c:

/** IDE disk interrupt handler registered for IRQ # 14. */
static void
idedisk_interrupt_handler(interrupt_state_t *state)
{
    (void) state;   /** Unused. */

    /** Head of queue is the active request currently on the fly. */
    block_request_t *req = ide_queue_head;
    if (req == NULL)
        return;

    ide_queue_head = ide_queue_head->next;

    /**
     * This "poll" should finish immediately, as the interrupt indicates
     * that the disk must have been ready.
     */
    _ide_poll_req(req);

    /** Wake up the process waiting on this request. */
    for (process_t *proc = ptable; proc < &ptable[MAX_PROCS]; ++proc) {
        if (proc->state == BLOCKED && proc->block_on == ON_IDEDISK
            && proc->wait_req == req) {
            process_unblock(proc);
        }
    }

    /** If more requests in queue, start the disk on the next one. */
    if (ide_queue_head != NULL)
        _ide_start_req(ide_queue_head);
    else
        ide_queue_tail = NULL;
}

Adding an IDE Disk in QEMU

QEMU is a wonderful emulator that supports emulation of a wide variety of native hardware devices, including IDE hard disk drives. See this page of QEMU documentation for detailed command-line invoation options.

To add an IDE disk 0 on primary bus to the emulation of the system, update the rules of invoking QEMU in our Makefile to:

FILESYS_IMG=fs.img

QEMU_OPTS=-vga std -cdrom $(TARGET_ISO) -m 128M \
          -drive if=ide,index=0,media=disk,file=$(FILESYS_IMG),format=raw


#
# Launching QEMU/debugging.
#
.PHONY: qemu
qemu:
    @echo
    @echo $(HUX_MSG) "Launching QEMU..."
    qemu-system-i386 $(QEMU_OPTS)

.PHONY: qemu_debug
qemu_debug:
    @echo
    @echo $(HUX_MSG) "Launching QEMU (debug mode)..."
    qemu-system-i386 $(QEMU_OPTS) -S -s

The emulated disk drive uses a file (fs.img here) on the host system as its storage space. Disk space is supposed to be in the form of a formatted file system image that adheres to the file system layout. Since we have not designed what our file system should look like yet, let's make it a zeroed file for now.

#
# File system image to be loaded as disk content.
#
.PHONY: filesys
filesys:
    @echo
    @echo $(HUX_MSG) "Making the file system image..."
    dd if=/dev/zero of=$(FILESYS_IMG) bs=512 count=1000

Progress So Far

Let's try do some requests to the IDE disk device! Doing block requests are privileged operations, so add a dummy syscall and call it in a user process:

int32_t
syscall_dummy(void)
{
    block_request_t req_w;
    req_w.valid = true;
    req_w.dirty = true;
    req_w.block_no = 2;
    memset(req_w.data, 'A', BLOCK_SIZE);
    idedisk_do_req(&req_w);
    printf("Written a block of char 'A's @ block index %u\n", req_w.block_no);

    block_request_t req_r;
    req_r.valid = false;
    req_r.dirty = false;
    req_r.block_no = 2;
    idedisk_do_req(&req_r);
    req_r.data[10] = '\0';
    printf("First 10 bytes read: %s\n", req_r.data);
}

This should produce a terminal window as the following after booting up:

And if you check the fs.img file on the host system (the emualted IDE disk) content after shutting down QEMU, you should see something like:

, which shows that data is really being persisted.

Current repo structure:

hux-kernel
├── Makefile
├── scripts
│   ├── gdb_init
│   ├── grub.cfg
│   └── kernel.ld
├── src
│   ├── boot
│   │   ├── boot.s
│   │   ├── elf.h
│   │   └── multiboot.h
│   ├── common
│   │   ├── debug.c
│   │   ├── debug.h
│   │   ├── intstate.c
│   │   ├── intstate.h
│   │   ├── port.c
│   │   ├── port.h
│   │   ├── printf.c
│   │   ├── printf.h
│   │   ├── string.c
│   │   ├── string.h
│   │   ├── types.c
│   │   └── types.h
│   ├── device
│   │   ├── idedisk.c
│   │   ├── idedisk.h
│   │   ├── keyboard.c
│   │   ├── keyboard.h
│   │   ├── sysdev.c
│   │   ├── sysdev.h
│   │   ├── timer.c
│   │   └── timer.h
│   ├── display
│   │   ├── sysdisp.c
│   │   ├── sysdisp.h
│   │   ├── terminal.c
│   │   ├── terminal.h
│   │   └── vga.h
│   ├── filesys
│   │   └── block.h
│   ├── interrupt
│   │   ├── idt-load.s
│   │   ├── idt.c
│   │   ├── idt.h
│   │   ├── isr-stub.s
│   │   ├── isr.c
│   │   ├── isr.h
│   │   ├── syscall.c
│   │   └── syscall.h
│   ├── memory
│   │   ├── gdt-load.s
│   │   ├── gdt.c
│   │   ├── gdt.h
│   │   ├── kheap.c
│   │   ├── kheap.h
│   │   ├── paging.c
│   │   ├── paging.h
│   │   ├── slabs.c
│   │   ├── slabs.h
│   │   ├── sysmem.c
│   │   └── sysmem.h
│   ├── process
│   │   ├── layout.h
│   │   ├── process.c
│   │   ├── process.h
│   │   ├── scheduler.c
│   │   ├── scheduler.h
│   │   ├── switch.s
│   │   ├── sysproc.c
│   │   └── sysproc.h
│   └── kernel.c
├── user
│   ├── lib
│   │   ├── debug.h
│   │   ├── malloc.c
│   │   ├── malloc.h
│   │   ├── printf.c
│   │   ├── printf.h
│   │   ├── string.c
│   │   ├── string.h
│   │   ├── syscall.h
│   │   ├── syscall.s
│   │   ├── syslist.s
│   │   ├── types.c
│   │   └── types.h
│   └── init.c

Guanzhou Jose Hu @ 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly