QEMU execution and instrumentation

This is a small tutorial that showcases how to interact with QEMU execution in order to either prototype a software solution that can be later ported to DevtefoFlex with hardware instrumentation, or for interacting with system elements that are not present in the FPGA and thus . We first go over why we care about full-system emulation and QEMU as a platform, then we go over the details of QEMU execution that allow us to change its behaviour.

NOTE: This guide was made for QEMU 6.0, later versions of QEMU might have changed the placement of things, but the principle should still hold. For the repository that has the basic instrumentation please refer to PARSA's EPFL KnockoutKraken branch.

Introduction of QEMU

QEMU has multiple modes of execution:

QEMU-KVM serves as an hypervisor for hardware virtualisation allowing to execute the guest as a virtual machine.
QEMU system-mode and user-mode emulation, both of these make use of JIT (Just In Time) interpretation of the guest ISA named TCG. User-mode can only execute programs that do not make use of system elements (e.g. devices, OS), and system-mode emulates the full system (booting process, hardware, OS, devices, network, etc) allowing to run programs identically as you would in the actual system.

Why use emulation and not virtualisation?

Virtualisation is around an order of magnitude faster that emulation, as instead of interpreting the ISA execution you are executing using the actual hardware. This has two downsides:

The guest ISA and host ISA must be identical
Virtualisation enables no extra control or tracing of execution that allowed by hardware support

The first can be overcome by having hardware matching the target ISA and with virtualisation support. But the second drawback is hard to overcome, you can not instrument existing hardware when making use of virtualization and you are restricted to pre-existing hardware support. While there are hardware features to allow some kind of instrumentation/tracing (e.g. Intel PT - Processor Trace), it is very limited.

On the other hand emulation of the guest ISA allows you to instrument instruction execution, add new system state, support new instructions, change instruction behaviour and much more. Emulation allows this with abstracting in software all state, there's a structure storing the ISA state (regs, PC, etc), you emulate instruction execution using host functions, etc. Using emulation can help you to either prototype in software before implementing hardware support in DevteroFlex as it exposes more or less the same amount of events than QEMU emulation. The obvious drawback is that you lose more than an order of magnitude of execution speed.

In DevteroFlex's context you might need to change QEMU emulation process when required to add functionality to existing execution:

DevteroFlex memory hierarchy acts as a cache/page demanding of the host memory. It could be the case that a memory instruction execution is not supported in the FPGA and must be executed in the host. This implies that before actually executing a memory operation in QEMU, you must first synchronise the page in the FPGA if previously pushed there, so we instrumented memory operations to callback a synchronisation routine.

Getting started: Instrumenting QEMU

An example of modifications for this small guide can be can be seen found in the following commit.

QEMU Emulated Execution

The QEMU emulation function call stack can be seen in the big lines as the following:

main loop (accel/tcg/tcg-accel-ops-rr.c:rr_cpu_thread_fn)
execution loop (accel/tcg/cpu-exec.c:cpu_exec)
instruction decoder (Tiny Code Generator) to a Translation Block (target/arm/translate-a64.c:aarch64_tr_translate_insn)
executing the Translation Blocks (accel/tcg/cpu-exec.c:cpu_tb_exec)

In parenthesis is the most important function where this action happens, sometimes it can be architecture specific (target/arm), also the main loop function can change depending on execution mode (multi-threaded TCG vs single threaded TCG). The main loop does mostly tasks related to QEMU itself, we wont talk about it here. The execution loop is a tight loop that translates then executes, and from time to time it handles exceptions and interrupts.
The instruction decoder is what we most care about, instructions are decoded into an aggregation of TCG host operations, here we can add callbacks to perform specific actions. Finally executing the Translation Blocks generated in previous step, all events executed were already prepared beforehand during decode, so we won't mention this step.

QEMU Tiny Code Generation (TCG)

The Tiny Code Generation can be seen as an instruction decoder tree, where a Translation Block will be built by inserting small TCG ops as it detect which kind of instruction it is and what actions must be performed. Later on, this Translation will be cached and executed.

Instrumenting QEMU in this manner can slowdown for more than 2x QEMU's performance even for the most basic operation (e.g. add counter).

We will follow the example of this process for ARM in target/arm/translate-a64.c.

The instruction start getting decoded at disas_a64_insn by fetching the 32-bit word corresponding to that virtual address:

static void disas_a64_insn(CPUARMState *env, DisasContext *s)
{
    uint32_t insn;
    s->pc_curr = s->base.pc_next;
    insn = arm_ldl_code(env, s->base.pc_next, s->sctlr_b);
    ...
}

The instruction will be then ran on a decode tree to check which type of instruction it is:

static void disas_a64_insn(CPUARMState *env, DisasContext *s)
{
    ...
    switch (extract32(insn, 25, 4)) {
    case 0x0: case 0x1: case 0x3: /* UNALLOCATED */
        unallocated_encoding(s);
        break;
    ...
    case 0xf:      /* Data processing - SIMD and floating point */
        disas_data_proc_simd_fp(s, insn);
        break;
    default:
        assert(FALSE); /* all 15 cases should be handled above */
        break;
    }
}

Once we knows the instruction type, it will be decomposed in a set of small TCG operations:

/* Unconditional branch (immediate)
 *   31  30       26 25                                  0
 * +----+-----------+-------------------------------------+
 * | op | 0 0 1 0 1 |                 imm26               |
 * +----+-----------+-------------------------------------+
 */
static void disas_uncond_b_imm(DisasContext *s, uint32_t insn) {
    uint64_t addr = s->pc_curr + sextract32(insn, 0, 26) * 4;
    if (insn & (1U << 31)) {
        /* BL Branch with link */
        tcg_gen_movi_i64(cpu_reg(s, 30), s->base.pc_next);
    }
    /* B Branch / BL Branch with link */
    reset_btype(s);
    gen_goto_tb(s, 0, addr);
}

This branching example is very basic, if the flag was set to save the destination address in a register, we insert the tcg_gen_movi_i64 to update the register, and finally we generate a jump to the destination address with gen_goto_tb. By looking the details of gen_goto_tb you will see that the PC is updated by inserting the operation gen_a64_set_pc_im.

Instrumenting QEMU TCG

We want now to insert function callbacks in specific places of the translation process.

Define the function to be called back when inserted to the TCG, as helpers are target specific, they must be located in target/arm. Function calls that can be inserted for TCG execution need to have the HELPER defined wrapped around their nape. We've prepared a file where you can add these callback functions in target/arm/devteroflex-helper.c:

void HELPER(devteroflex_example_instrumentation)(CPUARMState *env, uint64_t arg1, uint64_t arg2) { ... }

This function needs to be defined somewhere in a header, as it is a TCG 'helper' this needs to be done in a specific spot. Normally this would be done in target/arm/helpers-a64.h, but we prepared a special header file specially for DevteroFlex TCG callbacks in target/arm/qflex-helpers-a64.h:

#if defined(TCG_GEN) && defined(CONFIG_DEVTEROFLEX)
DEF_HELPER_3(devteroflex_example_instrumentation, void , env, i64, i64)
#elif !defined(CONFIG_DEVTEROFLEX)
void HELPER(devteroflex_example_instrumentation)(CPUARMState *env, uint64_t arg1, uint64_t arg2);
#endif

Now that we have a helper, we will insert it two places in the translation process, first after fetching the instruction (arm_ldl_code):

static void disas_a64_insn(CPUARMState *env, DisasContext *s) {
    uint32_t insn;
    s->pc_curr = s->base.pc_next;
    insn = arm_ldl_code(env, s->base.pc_next, s->sctlr_b);
    GEN_QFLEX_HELPER(devteroflexGen.example, GEN_HELPER(devteroflex_example_instrumentation)( 
                     cpu_env, tcg_const_i64(TAG_INSTRUCTION_DECODED), tcg_const_i64(s->base.pc_next)));
    ...
}

And as a second example on exception returns (ERET switch case):

static void disas_uncond_b_reg(DisasContext *s, uint32_t insn) {
   ...
   opc = extract32(insn, 21, 4);
   switch(opc) {
   ...
   case 4: /* ERET */
       ...
       GEN_QFLEX_HELPER(devteroflexGen.example, GEN_HELPER(devteroflex_example_instrumentation)( 
                        cpu_env, tcg_const_i64(TAG_EXCEPTION_RETURN), dst));
       gen_helper_exception_return(cpu_env, dst);
       ...
   }
   ...
}

Some tricks that are related to QEMU execution

QEMU caches TBs, so in case your helper only starts getting inserted when a flag is set, if the TB was previously translated and cached it won't actually execute the helper, thus we need to flush the TB first.
QEMU serves interrupts when exiting the execution loop, after flushing, you will exit the execution loop resulting in serving pending interrupts.
QEMU can sometimes exit during an instruction execution using a long jump (sigsetjmp(cpu->jmp_env, 0)), if you don't manage this correctly, depending on which location was the helper inserter it might be possible it gets executed twice for the same instruction. This is a rare event.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly