Skip to content

10. Two Level Paging Setup

Guanzhou Hu edited this page Jan 13, 2022 · 17 revisions

In this chapter, we will turn theory into practice. To be specific, we will go step by step to enabling two-level paging in our system. A functioning virtual memory system is indispensable for any further development of the kernel.

Main References of This Chapter

Scan through them before going forth:

Memory Layout Explanation

If we go back and revise our booting code src/boot/boot.s and our linker script scripts/kernel.ld that glues all the code and data sections together, we can clearly figure out the memory layout of our kernel image right after booting:

The kernel image is loaded into raw physical memory and there is no virtual memory system yet, so the addresses are just physical addresses. Let's assume that we want to support upto 128MiB of physical memory. For simplicity, our paging system will only support pages of size 4KiB, and the physical memory is divided into frames of size 4KiB.

Our goal after setting up paging is to have the kernel's virtual address space identity-mapped to the beginning of the physical memory and reserve some fixed amount of memory for the kernel heap. The rest of the physical memory will be free frames for mapping user process pages.

Be sure to understand where is the kernel's booting stack, where is the kernel heap, and how do they correspond to our code ✭. We will explain where we put the kernel stack of each process in other chapters.

Spoiler: each process will be allocated a (perhaps fixed-size) kernel stack somewhere in the kernel heap, and that requires a functioning heap memory allocation algorithm.

Memory Allocation Preparations

Before we start making page tables, we must come up with some basic solutions to allow memory allocation on the kernel heap. In particular, we will need two things: a temporary kernel memory allocator kalloc_temp(), and a free frame bitmap frame_bitmap to track the free/in-use state of each physical frame.

Where the Kernel Heap Starts

A dump of our kernel ELF image gives the following list of sections:

$ readelf -S hux.bin

Section Headers:
  [Nr] Name              Type            Addr     Off    Size   ES Flg Lk Inf Al
  [ 0]                   NULL            00000000 000000 000000 00      0   0  0
  [ 1] .text             PROGBITS        00100000 001000 0033f7 00  AX  0   0 16
  [ 2] .rodata           PROGBITS        00104000 005000 000268 00   A  0   0 32
  [ 3] .rodata.str1.1    PROGBITS        00104268 005268 000137 01 AMS  0   0  1
  [ 4] .rodata.str1.4    PROGBITS        001043a0 0053a0 00056c 01 AMS  0   0  4
  [ 5] .rodata.cst8      PROGBITS        00104910 005910 000008 08  AM  0   0  8
  [ 6] .rodata.cst4      PROGBITS        00104918 005918 00000c 04  AM  0   0  4
  [ 7] .data             PROGBITS        00105000 006000 000e20 00  WA  0   0 32
  [ 8] .bss              NOBITS          00106000 006e20 004cb8 00  WA  0   0 32
  [ 9] .comment          PROGBITS        00000000 006e20 000012 01  MS  0   0  1
  [10] .symtab           SYMTAB          00000000 006e34 0009b0 10     11  51  4
  [11] .strtab           STRTAB          00000000 0077e4 000565 00      0   0  1
  [12] .shstrtab         STRTAB          00000000 007d49 000075 00      0   0  1

As you can see, the .shstrtab section is the last section, and everything above it should be free memory. If you recall how we searched for symbol names in the debugging chapter, you will notice that we searched the ELF section header tables for the address and size of .symtab & .strtab. We now record an extra information: the end address of all sections. We round up it to be page-aligned and it will be the beginning of our kernel heap.

// src/common/debug.c

/** For marking the end of all sections. Kernel heap begins above this. */
uint32_t elf_sections_end;

void
debug_init(multiboot_info_t *mbi)
{
    /** Get the section header table as an array of section headers. */
    elf_section_header_t *sht = (elf_section_header_t *) mbi->elf_sht.addr;
    size_t sht_len = (size_t) mbi->elf_sht.num;

    /**
     * The section header at index 'shndx' in the table is a meta section
     * header. It contains a list of section header names in the table. We
     * should look for section header names ".symtab" & ".strtab".
     */
    const char *sh_names = (const char *) sht[mbi->elf_sht.shndx].addr;

    /**
     * Loop through the table and look for ".symtab" & ".strtab".
     *
     * Also records the highest address seen across all sections, which will
     * later determine the starting point of kernel heap.
     */
    for (size_t i = 0; i < sht_len; ++i) {
        const char *name = sh_names + sht[i].name;

        if (strncmp(name, ".symtab", 7) == 0) {
            elf_symtab = (elf_symbol_t *) sht[i].addr;
            elf_symtab_size = sht[i].size;
        } else if (strncmp(name, ".strtab", 7) == 0) {
            elf_strtab = (const char *) sht[i].addr;
            elf_strtab_size = sht[i].size;
        }

        // Add this.
        if (sht[i].addr + sht[i].size > elf_sections_end)
            elf_sections_end = sht[i].addr + sht[i].size;
    }
}
// src/common/debug.h

/** For marking the begin of kernel heap. */
extern uint32_t elf_sections_end;

Also a quick recall of where the kernel's booting stack is: in the .bss section (the uninitialized static variables section) - we reserved 16KiB space for the booting stack in boot.s.

Temporary Kernel Heap Allocator

Since we do not have a heap memory allocation algorithm at this time, we will use a temporary solution kalloc_temp() which simply maintains a pointer kheap_curr and grows the pointer at each allocation. Code @ src/memory/paging.c:

/** Kernel heap bottom address - should be above `elf_sections_end`. */
uint32_t kheap_curr;


/**
 * Auxiliary function for allocating (page-aligned) chunks of memory in the
 * kernel heap region that never gets freed.
 * 
 * Should only be used to allocate the kernel's page directory/tables and
 * the frames bitmap and other things before our actual heap allocation
 * algorithm setup.
 */
static uint32_t
_kalloc_temp(size_t size, bool page_align)
{
    /** If `page_align` is set, return an aligned address. */
    if (page_align && !ADDR_PAGE_ALIGNED(kheap_curr))
        kheap_curr = ADDR_PAGE_ROUND_UP(kheap_curr);

    /** If exceeds the 8MiB kernel memory boundary, panic. */
    if (kheap_curr + size > KMEM_MAX)
        error("_kalloc_temp: kernel memory exceeds boundary");

    uint32_t temp = kheap_curr;
    kheap_curr += size;
    return temp;
}
// src/memory/paging.h

/** Extern resulted `kheap_curr` for heap allocator initialization. */
extern uint32_t kheap_curr;

This allows allocating bytes at the bottom of kernel heap, but it does not allow freeing any of the allocated bytes. Fortunately, everything we need to allocate in this chapter will never be freed because they are data structures that must exist throughout the kernel's lifetime. We will develop a complete heap memory allocation algorithm in the next chapter, after we have switched to the paging world.

Frame Allocation Bitmap

We also need a data structure for tracking which physical frames are free and which are in-use. The best match for this kind of task is a bitmap. We create an array of uint32_t's, where each integer element is essentially an array of 32 bits. Every bit tracks the in-use state of a physical frame: 0 means free and 1 means in use.

Add these implementations @ src/memory/paging.c:

/** Bitmap indicating free/used frames. */
static uint8_t *frame_bitmap;

/**
 * Helper functions for managing free physical frames, using a bitmap
 * data structure. Every bit indicates the free/used state of a corresponding
 * physical frame. Frame number one-one maps to bit index.
 */
#define BITMAP_OUTER_IDX(frame_num) ((frame_num) / 8)
#define BITMAP_INNER_IDX(frame_num) ((frame_num) % 8)

/** Set a frame as used. */
static inline void
frame_bitmap_set(uint32_t frame_num)
{
    size_t outer_idx = BITMAP_OUTER_IDX(frame_num);
    size_t inner_idx = BITMAP_INNER_IDX(frame_num);
    frame_bitmap[outer_idx] |= (1 << (7 - inner_idx));
}

/** Clear a frame as free. */
static inline void
frame_bitmap_clear(uint32_t frame_num)
{
    size_t outer_idx = BITMAP_OUTER_IDX(frame_num);
    size_t inner_idx = BITMAP_INNER_IDX(frame_num);
    frame_bitmap[outer_idx] &= ~(1 << (7 - inner_idx));
}

/** Returns true if a frame is in use, otherwise false. */
static inline bool
frame_bitmap_check(uint32_t frame_num)
{
    size_t outer_idx = BITMAP_OUTER_IDX(frame_num);
    size_t inner_idx = BITMAP_INNER_IDX(frame_num);
    return frame_bitmap[outer_idx] & (1 << (7 - inner_idx));
}

/**
 * Allocate a frame and mark as used. Returns the frame number of
 * the allocated frame, or panics if there is no free frame.
 */
static uint32_t
frame_bitmap_alloc(void)
{
    for (size_t i = 0; i < (NUM_FRAMES / 8); ++i) {
        if (frame_bitmap[i] == 0xFF)
            continue;
        for (size_t j = 0; j < 8; ++j) {
            if ((frame_bitmap[i] & (1 << (7 - j))) == 0) {
                /** Found a free frame. */
                uint32_t frame_num = i * 8 + j;
                frame_bitmap_set(frame_num);
                return frame_num;
            }
        }
    }

    return NUM_FRAMES;
}

Page Table Structures

Now, we can move on to define the page table entries structure (according to what the x86 32-bit MMU expects, see here). We will have a two-level page table ✭.

  • Upper level is a 4KiB page directory, which contains 1024 page directory entries (PDEs), each 32bits long;
  • Every PDE, if valid, points to a level-2 4KiB page table, which contains 1024 page table entries (PTEs), each 32bits long;
  • Each PTE records the physical frame allocated for that page.

Given any 32-bit virtual address, it is easy to know how to locate its PTE:

32-bit VADDR := |31  PDE Index  22|21  PTE Index  12|11  Offset In Page  0|

Add these definitions @ src/memory/paging.h:

/** Assume 4KiB pages, not support any other sizes. */
#define PAGE_SIZE 4096

#define PTES_PER_PAGE 1024
#define PDES_PER_PAGE 1024


/** Number of physical frames available. Assume 128MiB physical memory. */
#define PHYS_MAX 0x08000000     /** 128MiB physical memory. */
#define NUM_FRAMES (PHYS_MAX / PAGE_SIZE)

/** Up to where is kernel memory, == the upper bound of kernel heap. */
#define KMEM_MAX 0x00800000     /** 8MiB reserved for the kernel. */


/**
 * Page table entry format, 32bits per entry. Order in struct
 * definition is from LSB -> MSB.
 * 
 * See https://wiki.osdev.org/Paging for the detailed definition.
 */
struct page_table_entry {
    uint32_t present  :  1;     /** Set -> present in memory. */
    uint32_t writable :  1;     /** Set -> user writable. (read/write bit) */
    uint32_t user     :  1;     /** Set -> user accessible. */
    uint32_t unused0  :  2;     /** Unused 2 caching bits. */
    uint32_t accessed :  1;     /** Set -> accessed sinced mapped. */
    uint32_t dirty    :  1;     /** Set -> page has been written to. */
    uint32_t unused1  :  5;     /** Unused 5 misc bits. */
    uint32_t frame    : 20;     /** Physical frame number of the page. */
} __attribute__((packed));
typedef struct page_table_entry pte_t;

/**
 * Page directory entry format, 32bits per entry. Order in struct
 * definition is from LSB -> MSB.
 * 
 * See https://wiki.osdev.org/Paging for the detailed definition.
 */
struct page_directory_entry {
    uint32_t present  :  1;     /** Set -> present in memory. */
    uint32_t writable :  1;     /** Set -> user writable. (read/write bit) */
    uint32_t user     :  1;     /** Set -> user accessible. */
    uint32_t unused0  :  2;     /** Unused 2 caching bits. */
    uint32_t accessed :  1;     /** Set -> accessed sinced mapped. */
    uint32_t unused1  :  1;     /** Unused bit. */
    uint32_t size     :  1;     /** 0 -> using 4KiB page size. */
    uint32_t unused2  :  4;     /** Unused 4 misc bits. */
    uint32_t frame    : 20;     /** Physical frame number of level-2 table. */
} __attribute__((packed));
typedef struct page_directory_entry pde_t;

Notice the packed attribute used on these two structs (along with bit fields) to avoid GCC struct padding and to guarantee that a structure takes exactly 32 bits in this format.

It is also helpful to have some translation helper macros @ src/memory/paging.h:

/** Helper macros on addresses and page alignments. */
#define ADDR_PAGE_OFFSET(addr) ((addr) & 0x00000FFF)
#define ADDR_PAGE_NUMBER(addr) ((addr) >> 12)

#define ADDR_PDE_INDEX(addr) (ADDR_PAGE_NUMBER(addr) / 1024)
#define ADDR_PTE_INDEX(addr) (ADDR_PAGE_NUMBER(addr) % 1024)

#define ADDR_PAGE_ALIGNED(addr) (ADDR_PAGE_OFFSET(addr) == 0)

#define ADDR_PAGE_ROUND_DN(addr) ((addr) & 0xFFFFF000)
#define ADDR_PAGE_ROUND_UP(addr) (ADDR_PAGE_ROUND_DN((addr) + 0x00000FFF))


/** Helper macro on getting the pointed-to address stored in an entry. */
#define ENTRY_FRAME_ADDR(entry) ((uint32_t) (entry).frame << 12)

The action of finding a PTE for a corresponding virtual address is often called walking the page directory/table. It is nice to have a function for doing that @ src/memory/paging.c:

/**
 * Walk a 2-level page table for a virtual address to locate its PTE.
 * If `alloc` is true, then when a level-2 table is needed but not
 * allocated yet, will perform the allocation.
 */
pte_t *
paging_walk_pgdir_at_boot(pde_t *pgdir, uint32_t vaddr, bool alloc)
{
    size_t pde_idx = ADDR_PDE_INDEX(vaddr);
    size_t pte_idx = ADDR_PTE_INDEX(vaddr);

    /** If already has the level-2 table, return the correct PTE. */
    if (pgdir[pde_idx].present != 0) {
        pte_t *pgtab = (pte_t *) ENTRY_FRAME_ADDR(pgdir[pde_idx]);
        return &pgtab[pte_idx];
    }

    /**
     * Else, the level-2 table is not allocated yet. Do the allocation if
     * the alloc argument is set, otherwise return a NULL.
     */
    if (!alloc)
        return NULL;

    pte_t *pgtab = (pte_t *) _kalloc_temp(sizeof(pte_t) * PTES_PER_PAGE, true);
    assert(pgtab != NULL);
    memset(pgtab, 0, sizeof(pte_t) * PTES_PER_PAGE);

    pgdir[pde_idx].present = 1;
    pgdir[pde_idx].writable = 0;
    pgdir[pde_idx].user = 1;    /** Just allow user access on all PDEs. */
    pgdir[pde_idx].frame = ADDR_PAGE_NUMBER((uint32_t) pgtab);

    return &pgtab[pte_idx];
}

Switching to the Paging World

With the mechanisms ready, we can now allocate space for the kernel's page directory, identity-map all the kernel-reserved memory pages, and finally, tell the underlying x86 hardware to switch to paging mode.

We group these into an initialization function to be called at booting @ src/memory/paging.c:

/** kernel's identity-mapping page directory. */
pde_t *kernel_pgdir;    /** Allocated at paging init. */


/** Switch the current page directory to the given one. */
inline void
paging_switch_pgdir(pde_t *pgdir)
{
    assert(pgdir != NULL);
    asm volatile ( "movl %0, %%cr3" : : "r" (pgdir) );
}

/** Initialize paging and switch to use paging. */
void
paging_init(void)
{
    /** Kernel heap starts above all ELF sections. */
    kheap_curr = ADDR_PAGE_ROUND_UP((uint32_t) elf_sections_end);

    /**
     * The frame bitmap also needs space, so allocate space for it in
     * our kernel heap. Clear it to zeros.
     */
    frame_bitmap = (uint8_t *) _kalloc_temp(NUM_FRAMES / 8, false);
    memset(frame_bitmap, 0, NUM_FRAMES / 8);

    /**
     * Allocate the one-page space for the kernel's page directory in
     * the kernel heap. All pages of page directory/tables must be
     * page-aligned.
     */
    kernel_pgdir = (pde_t *) _kalloc_temp(sizeof(pde_t) * PDES_PER_PAGE, true);
    memset(kernel_pgdir, 0, sizeof(pde_t) * PDES_PER_PAGE);

    /**
     * Identity-map the kernel's virtual address space to the physical
     * memory. This means we need to map all the allowed kernel physical
     * frames (from 0 -> KMEM_MAX) as its identity virtual address in
     * the kernel page table, and reserve this entire physical memory region.
     *
     * Assumes that `frame_bitmap_alloc()` behaves sequentially.
     */
    uint32_t addr = 0;
    while (addr < KMEM_MAX) {
        uint32_t frame_num = frame_bitmap_alloc();
        assert(frame_num < NUM_FRAMES);
        pte_t *pte = paging_walk_pgdir_at_boot(kernel_pgdir, addr, true);
        assert(pte != NULL);

        /** Update the bits in this PTE. */
        pte->present = 1;
        pte->writable = 0;      /** Has no affect. */
        pte->user = 0;
        pte->frame = frame_num;

        addr += PAGE_SIZE;
    }

    /**
     * Also map the rest of physical memory into the scheduler page table,
     * so it could access any physical address directly.
     */
    while (addr < PHYS_MAX) {
        pte_t *pte = paging_walk_pgdir_at_boot(kernel_pgdir, addr, true);
        assert(pte != NULL);

        /** Update the bits in this PTE. */
        pte->present = 1;
        pte->writable = 0;      /** Has no affect. */
        pte->user = 0;
        pte->frame = ADDR_PAGE_NUMBER(addr);

        addr += PAGE_SIZE;
    }

    /**
     * Register the page fault handler. This acation must be done before
     * we do the acatual switch towards using paging.
     */
    isr_register(INT_NO_PAGE_FAULT, &page_fault_handler);
                 // 14, add macro definition in `src/interrupt/isr.h`

    /** Load the address of kernel page directory into CR3. */
    paging_switch_pgdir(kernel_pgdir);

    /**
     * Enable paging by setting the two proper bits of CR0:
     *   - PG bit (31): enable paging
     *   - PE bit (0): enable protected mode
     *   
     * We are not setting the WP bit, so the read/write bit of any PTE just
     * controls whether the page is user writable - in kernel priviledge any
     * page can be written.
     */
    uint32_t cr0;
    asm volatile ( "movl %%cr0, %0" : "=r" (cr0) : );
    cr0 |= 0x80000001;
    asm volatile ( "movl %0, %%cr0" : : "r" (cr0) );
}

Don't forget to export the function definitions @ src/memory/paging.h:

/** Extern the kernel page directory pointer to the scheduler. */
extern pde_t *kernel_pgdir;


void paging_init();

pte_t *paging_walk_pgdir_at_boot(pde_t *pgdir, uint32_t vaddr, bool alloc);

void paging_switch_pgdir(pde_t *pgdir);

Note the order in which we do things ✭:

  1. Allocate and initialize the frame bitmap and the kernel page directory
  2. Map all the pages in the kernel-reserved 8MiB region to their identical address on physical memory (so 0 - 8MiB on physical memory)
  3. Register the page fault handler (will come to this soon)
  4. Switch the active page directory to the kernel's page directory by setting the value of control register CR3
  5. Finally, enable paging and enter protected mode at the same time by setting the corresponding two flags of control register CR0

Handling Page Faults

There is one last thing we need to take care of. After switching to paging mode, the hardware MMU issues a page fault exception (on x86, the interrupt # 14) whenever an access to an unmapped virtual address happens. The OS must provide a handler for that interrupt and decide what to do with this access, either allocating a frame for a valid access or kill the offending process for an invalid access.

For now, we write a basic page fault handler @ src/memory/paging.c (and register it in paging_init(), as shown in the last section):

/** Page fault (ISR # 14) handler. */
static void
page_fault_handler(interrupt_state_t *state)
{
    /** The CR2 register holds the faulty address. */
    uint32_t faulty_addr;
    asm ( "movl %%cr2, %0" : "=r" (faulty_addr) : );

    /**
     * Analyze the least significant 3 bits of error code to see what
     * triggered this page fault:
     *   - bit 0: page present -> 1, otherwise 0
     *   - bit 1: is a write operation -> 1, read -> 0
     *   - bit 2: is from user mode -> 1, kernel -> 0
     *
     * See https://wiki.osdev.org/Paging for more.
     */
    bool present = state->err_code & 0x1;
    bool write   = state->err_code & 0x2;
    bool user    = state->err_code & 0x4;

    /** Just prints an information message for now. */
    info("Caught page fault {\n"
         "  faulty addr = %p\n"
         "  present: %d\n"
         "  write:   %d\n"
         "  user:    %d\n"
         "}", faulty_addr, present, write, user);

    panic("page fault not handled!");
}

Upon a page fault, the CR2 register will hold the virtual address value that trigger the fault, and the err_code field in the interrupt state contains three flags that explain the reason of the fault. The handler should then react accordingly to different combinations of the three flags. For now, we are just printing the information out, since we have no user process support yet.

Progress So Far

Let's try out and see how our kernel behaves in paging mode and how it reacts to a page fault! Add these to the main function @ src/kernel.c:

#include "memory/paging.h"


/** The main function that `boot.s` jumps to. */
void
kernel_main(unsigned long magic, unsigned long addr)
{
    ... // All the previous initializations...

    /** Initialize paging and switch to use paging. */
    _init_message("setting up virtual memory using paging");
    paging_init();
    _init_message_ok();
    info("supporting physical memory size: %3dMiB", NUM_FRAMES * 4 / 1024);
    info("reserving memory for the kernel: %3dMiB", KMEM_MAX / 1024 / 1024);

    /** Executes `sti`, CPU starts taking in interrupts. */
    asm volatile ( "sti" );

    uint32_t faulty_addr = 0xBCDE0123;

    printf("\nKernel reading an unmapped address!\n");
    int dummy = *((int *) faulty_addr);
    printf("%d\n", dummy);

    while (1)   // CPU idles with a `hlt` loop.
        asm volatile ( "hlt" );
}

The _init_message*() functions are just printing some progress messages. Don't forget to comment out the dummy printf("."); in our current timer interrupt handler.

This should produce a terminal window as the following after booting up:

Current repo structure:

hux-kernel
├── Makefile
├── scripts
│   ├── gdb_init
│   ├── grub.cfg
│   └── kernel.ld
├── src
│   ├── boot
│   │   ├── boot.s
│   │   ├── elf.h
│   │   └── multiboot.h
│   ├── common
│   │   ├── debug.c
│   │   ├── debug.h
│   │   ├── port.c
│   │   ├── port.h
│   │   ├── printf.c
│   │   ├── printf.h
│   │   ├── string.c
│   │   ├── string.h
│   │   ├── types.c
│   │   └── types.h
│   ├── device
│   │   ├── keyboard.c
│   │   ├── keyboard.h
│   │   ├── timer.c
│   │   └── timer.h
│   ├── display
│   │   ├── terminal.c
│   │   ├── terminal.h
│   │   └── vga.h
│   ├── interrupt
│   │   ├── idt-load.s
│   │   ├── idt.c
│   │   ├── idt.h
│   │   ├── isr-stub.s
│   │   ├── isr.c
│   │   └── isr.h
│   ├── memory
│   │   ├── gdt-load.s
│   │   ├── gdt.c
│   │   ├── gdt.h
│   │   ├── paging.c
│   │   └── paging.h
│   └── kernel.c