diff --git a/docs/cancellation.md b/docs/cancellation.md new file mode 100644 index 000000000..459e3a35d --- /dev/null +++ b/docs/cancellation.md @@ -0,0 +1,291 @@ +# Cancellation in Hyperlight + +This document describes the cancellation mechanism and memory ordering guarantees for Hyperlight. + +## Overview (Linux) + +Hyperlight provides a mechanism to forcefully interrupt guest execution through the `InterruptHandle::kill()` method. This involves coordination between multiple threads using atomic operations and POSIX signals to ensure safe and reliable cancellation. + +## Key Components + +### LinuxInterruptHandle State + +The `LinuxInterruptHandle` uses a packed atomic u8 to track execution state: + +- **state (AtomicU8)**: Packs three bits: + - **Bit 2 (DEBUG_INTERRUPT_BIT)**: Set when debugger interrupt is requested (gdb feature only) + - **Bit 1 (RUNNING_BIT)**: Set when vCPU is actively running in guest mode + - **Bit 0 (CANCEL_BIT)**: Set when cancellation has been requested via `kill()` +- **tid (AtomicU64)**: Thread ID where the vCPU is running +- **dropped (AtomicBool)**: Set when the corresponding VM has been dropped + +The packed state enables atomic reads of RUNNING_BIT, CANCEL_BIT and DEBUG_INTERRUPT_BIT simultaneously via `get_running_cancel_debug()`. Within a single `VirtualCPU::run()` call, the CANCEL_BIT remains set across vcpu exits and re-entries (such as when calling host functions), ensuring cancellation persists until the guest call completes. However, `clear_cancel()` resets the CANCEL_BIT at the beginning of each new guest function call (specifically in `MultiUseSandbox::call`, before `VirtualCPU::run()` is called), preventing cancellation requests from affecting subsequent guest function calls. + +### Signal Mechanism + +On Linux, Hyperlight uses `SIGRTMIN + offset` (configurable, default offset is 0) to interrupt the vCPU thread. The signal handler is intentionally a no-op - the signal's only purpose is to cause a VM exit via `EINTR` from the `ioctl` call that runs the vCPU. + +## Run Loop Flow + +The main execution loop in `VirtualCPU::run()` coordinates vCPU execution with potential interrupts. + +```mermaid +sequenceDiagram + participant Caller as Caller (call()) + participant vCPU as vCPU (run()) + participant IH as InterruptHandle + + Note over Caller: === TIMING POINT 1 === + Caller->>IH: clear_cancel() + Note right of Caller: Start of cancellable window + + Caller->>vCPU: run() + activate vCPU + + loop Run Loop + Note over vCPU: === TIMING POINT 2 === + vCPU->>IH: set_tid() + vCPU->>IH: set_running() + Note right of vCPU: Enable signal delivery + + vCPU->>IH: is_cancelled() + + alt is_cancelled() == true + vCPU-->>Caller: return Cancelled() + else is_cancelled() == false + Note over vCPU: === TIMING POINT 3 === + vCPU->>vCPU: run_vcpu() (Enter Guest) + activate vCPU + + alt Guest completes normally + vCPU-->>vCPU: VmExit::Halt() + else Guest performs I/O + vCPU-->>vCPU: VmExit::IoOut()/MmioRead() + else Signal received + vCPU-->>vCPU: VmExit::Cancelled() + end + deactivate vCPU + end + + Note over vCPU: === TIMING POINT 4 === + vCPU->>IH: clear_running() + Note right of vCPU: Disable signal delivery + + Note over vCPU: === TIMING POINT 5 === + vCPU->>IH: is_cancelled() + IH-->>vCPU: cancel_requested (bool) + Note right of vCPU: Check if we should exit + + Note over vCPU: === TIMING POINT 6 === + + alt Exit reason is Halt + vCPU-->>Caller: return Ok(()) + else Exit reason is Cancelled AND cancel_requested==true + vCPU-->>Caller: return Err(ExecutionCanceledByHost) + else Exit reason is Cancelled AND cancel_requested==false + Note right of vCPU: Stale signal, retry + vCPU->>vCPU: continue (retry iteration) + else Exit reason is I/O or host call + vCPU->>vCPU: Handle and continue loop + end + end + deactivate vCPU +``` + +### Detailed Run Loop Steps + +1. **Timing Point 1** - Start of Guest Call (in `call()`): + - `clear_cancel()` resets the cancellation state *before* `run()` is called. + - Any `kill()` completed before this point is ignored. + +2. **Timing Point 2** - Start of Loop Iteration: + - `set_running()` enables signal delivery. + - Checks `is_cancelled()` immediately to handle pre-run cancellation. + +3. **Timing Point 3** - Guest Entry: + - Enters guest execution. + - If `kill()` happens now, signals will interrupt the guest. + +4. **Timing Point 4** - Guest Exit: + - `clear_running()` disables signal delivery. + - Signals sent after this point are ignored. + +5. **Timing Point 5** - Capture State: + - `is_cancelled()` captures the cancellation request state. + - This determines if a `Cancelled` exit was genuine or stale. + +6. **Timing Point 6** - Handle Exit: + - Processes the exit reason based on the captured `cancel_requested` state. + - If `Cancelled` but `!cancel_requested`, it's a stale signal -> retry. + +## Kill Operation Flow + +The `kill()` operation involves setting the CANCEL_BIT and sending signals to interrupt the vCPU: + +```mermaid +sequenceDiagram + participant Caller as Caller Thread + participant IH as InterruptHandle + participant Signal as Signal Delivery + participant vCPU as vCPU Thread + + Caller->>IH: kill() + activate IH + + IH->>IH: fetch_or(CANCEL_BIT, Release) + Note right of IH: Atomically set cancel=true
with Release ordering + + IH->>IH: send_signal() + activate IH + + loop Retry Loop + IH->>IH: get_running_and_cancel() + Note right of IH: Load with Acquire ordering + + alt Not running OR not cancelled + IH-->>IH: break (sent_signal=false/true) + else Running AND cancelled + IH->>IH: tid.load(Acquire) + IH->>Signal: pthread_kill(tid, SIGRTMIN+offset) + activate Signal + Note right of Signal: Send signal to vCPU thread + Signal->>vCPU: SIGRTMIN+offset delivered + Note right of vCPU: Signal handler is no-op
Purpose is to cause EINTR + deactivate Signal + + alt Signal arrives during ioctl + vCPU->>vCPU: ioctl returns EINTR + vCPU->>vCPU: return VmExit::Cancelled() + else Signal arrives between ioctls + Note right of vCPU: Signal is harmless + end + + IH->>IH: sleep(retry_delay) + Note right of IH: Default 500μs between retries + end + end + + deactivate IH + IH-->>Caller: sent_signal + deactivate IH +``` + +### Kill Operation Steps + +1. **Set Cancel Flag**: Atomically set the CANCEL_BIT using `fetch_or(CANCEL_BIT)` with `Release` ordering + - Ensures all writes before `kill()` are visible when vCPU thread checks `is_cancelled()` with `Acquire` + +2. **Send Signals**: Enter retry loop via `send_signal()` + - Atomically load running, cancel and debug flags via `get_running_cancel_debug()` with `Acquire` ordering + - Continue if `running=true AND cancel=true` (or `running=true AND debug=true` with gdb) + - Exit loop immediately if `running=false OR (cancel=false AND debug=false)` + +3. **Signal Delivery**: Send `SIGRTMIN+offset` via `pthread_kill` + - Signal interrupts the `ioctl` that runs the vCPU, causing `EINTR` + - Signal handler is intentionally a no-op + - Returns `VmExit::Cancelled()` when `EINTR` is received + +4. **Loop Termination**: The signal loop terminates when: + - vCPU is no longer running (`running=false`), OR + - Cancellation is no longer requested (`cancel=false`) + - See the loop termination proof in the source code for rigorous correctness analysis + +## Memory Ordering Guarantees + +Hyperlight uses Release-Acquire semantics to ensure correctness across threads: + +```mermaid +graph TB + subgraph "vCPU Thread" + A[set_tid
Store tid with Release] + B[set_running
fetch_update RUNNING_BIT
with Release] + C[is_cancelled
Load with Acquire] + D[clear_running
fetch_and with Release] + J[is_debug_interrupted
Load with Acquire] + end + + subgraph "Interrupt Thread" + E[kill
fetch_or CANCEL_BIT
with Release] + F[send_signal
Load running with Acquire] + G[Load tid with Acquire] + H[pthread_kill] + I[kill_from_debugger
fetch_or DEBUG_INTERRUPT_BIT
with Release] + end + + B -->|Synchronizes-with| F + A -->|Happens-before via B→F| G + E -->|Synchronizes-with| C + D -->|Synchronizes-with| F + I -->|Synchronizes-with| J +``` + +### Ordering Rules + +1. **tid Store → running Load**: `set_tid` (Release) synchronizes with `send_signal` (Acquire), ensuring the interrupt thread sees the correct thread ID. +2. **CANCEL_BIT**: `kill` (Release) synchronizes with `is_cancelled` (Acquire), ensuring the vCPU sees the cancellation request. +3. **clear_running**: `clear_running` (Release) synchronizes with `send_signal` (Acquire), ensuring the interrupt thread stops sending signals when the vCPU stops. +4. **clear_cancel**: Uses Release to ensure operations from the previous run are visible to other threads. +5. **dropped flag**: `set_dropped` (Release) synchronizes with `dropped` (Acquire), ensuring cleanup visibility. +6. **debug_interrupt**: `kill_from_debugger` (Release) synchronizes with `is_debug_interrupted` (Acquire), ensuring the vCPU sees the debug interrupt request. + +## Interaction with Host Function Calls + +When a guest performs a host function call, the vCPU exits and `RUNNING_BIT` is cleared. `CANCEL_BIT` persists, so if `kill()` is called during the host call, cancellation is detected when the guest attempts to resume. + +## Signal Behavior Across Loop Iterations + +When the run loop iterates (e.g., for host calls): +1. `clear_running()` sets `running=false`, causing any active `send_signal()` loop to exit. +2. `set_running()` sets `running=true` again. +3. `is_cancelled()` detects the persistent `cancel` flag and returns early. + +## Race Conditions + +1. **kill() between calls**: `clear_cancel()` at Timing Point 1 ensures `kill()` requests from before the current call are ignored. +2. **kill() before run_vcpu()**: Signals interrupt the guest immediately. +3. **Guest completes before signal**: If the guest finishes naturally, the signal is ignored or causes a retry in the next iteration (handled as stale). +4. **Stale signals**: If a signal from a previous call arrives during a new call, `cancel_requested` (checked at Timing Point 5) will be false, causing a retry. +5. **ABA Problem**: Clearing `CANCEL_BIT` at the start of `run()` breaks any ongoing `send_signal()` loops from previous calls. + +## Windows Platform Differences + +While the core cancellation mechanism follows the same conceptual model on Windows, there are several platform-specific differences in implementation: + +### WindowsInterruptHandle Structure + +The `WindowsInterruptHandle` uses a simpler structure compared to Linux: + +- **state (AtomicU8)**: Packs three bits (RUNNING_BIT, CANCEL_BIT and DEBUG_INTERRUPT_BIT) +- **partition_handle**: Windows Hyper-V partition handle for the VM +- **dropped (AtomicBool)**: Set when the corresponding VM has been dropped + +**Key difference**: No `tid` field is needed because Windows doesn't use thread-targeted signals. No `retry_delay` or `sig_rt_min_offset` fields are needed. + +### Kill Operation Differences + +On Windows, the `kill()` method uses the Windows Hypervisor Platform (WHP) API `WHvCancelRunVirtualProcessor` instead of POSIX signals to interrupt the vCPU: + +**Key differences**: +1. **No signal loop**: Windows calls `WHvCancelRunVirtualProcessor()` at most once in `kill()`, without needing retries + +### Why Linux Needs a Retry Loop but Windows Doesn't + +The fundamental difference between the platforms lies in how cancellation interacts with the hypervisor: + +**Linux (KVM/mshv3)**: POSIX signals can only interrupt the vCPU when the thread is executing kernel code (specifically, during the `ioctl` syscall that runs the vCPU). There is a narrow timing window between when the signal is sent and when the vCPU enters guest mode. If a signal arrives before entering guest mode, it will be delivered but won't interrupt the guest execution. This requires repeatedly sending signals with delays until either: +- The vCPU exits (and consequently RUNNING_BIT becomes false), or +- The cancellation is cleared (CANCEL_BIT becomes false) + +**Windows (WHP)**: The `WHvCancelRunVirtualProcessor()` API sets an internal `CancelPending` flag in the Windows Hypervisor Platform. This flag is: +- Set immediately by the API call +- Checked at the start of each VM run loop iteration (before entering guest mode) +- Automatically cleared when it causes a `WHvRunVpExitReasonCanceled` exit + +This means if `WHvCancelRunVirtualProcessor()` is called: +- **While the vCPU is running**: The API signals the hypervisor to exit with `WHvRunVpExitReasonCanceled` +- **Before VM runs**: The `CancelPending` flag persists and causes an immediate cancellation on the next VM run attempt + +Therefore, we only call `WHvCancelRunVirtualProcessor()` after checking that `RUNNING_BIT` is set. This is important because: +1. If called when not running, the API would still succeed and will unconditionally cancel the next run attempt. This is bad since `kill()` should have no effect if the vCPU is not running +2. This makes the InterruptHandle's `CANCEL_BIT` (which is cleared at the start of each guest function call) the source of truth for whether cancellation is intended for the current call + diff --git a/src/hyperlight_host/src/hypervisor/hyperv_linux.rs b/src/hyperlight_host/src/hypervisor/hyperv_linux.rs index 64074b5f9..7649aae2e 100644 --- a/src/hyperlight_host/src/hypervisor/hyperv_linux.rs +++ b/src/hyperlight_host/src/hypervisor/hyperv_linux.rs @@ -18,7 +18,7 @@ extern crate mshv_bindings; extern crate mshv_ioctls; use std::fmt::{Debug, Formatter}; -use std::sync::atomic::{AtomicBool, AtomicU64, Ordering}; +use std::sync::atomic::{AtomicBool, AtomicU8, AtomicU64}; use std::sync::{Arc, Mutex}; use log::{LevelFilter, error}; @@ -51,8 +51,8 @@ use super::gdb::{ use super::{HyperlightExit, Hypervisor, LinuxInterruptHandle, VirtualCPU}; #[cfg(gdb)] use crate::HyperlightError; -use crate::hypervisor::get_memory_access_violation; use crate::hypervisor::regs::CommonFpu; +use crate::hypervisor::{InterruptHandle, InterruptHandleImpl, get_memory_access_violation}; use crate::mem::memory_region::{MemoryRegion, MemoryRegionFlags}; use crate::mem::mgr::SandboxMemoryManager; use crate::mem::ptr::{GuestPtr, RawPtr}; @@ -273,7 +273,7 @@ pub(crate) struct HypervLinuxDriver { vcpu_fd: VcpuFd, orig_rsp: GuestPtr, entrypoint: u64, - interrupt_handle: Arc, + interrupt_handle: Arc, mem_mgr: Option>, host_funcs: Option>>, @@ -374,12 +374,8 @@ impl HypervLinuxDriver { vm_fd.map_user_memory(mshv_region) })?; - let interrupt_handle = Arc::new(LinuxInterruptHandle { - running: AtomicU64::new(0), - cancel_requested: AtomicU64::new(0), - call_active: AtomicBool::new(false), - #[cfg(gdb)] - debug_interrupt: AtomicBool::new(false), + let interrupt_handle: Arc = Arc::new(LinuxInterruptHandle { + state: AtomicU8::new(0), #[cfg(all( target_arch = "x86_64", target_vendor = "unknown", @@ -500,8 +496,11 @@ impl Hypervisor for HypervLinuxDriver { }; self.vcpu_fd.set_regs(®s)?; + let interrupt_handle = self.interrupt_handle.clone(); + VirtualCPU::run( self.as_mut_hypervisor(), + interrupt_handle, #[cfg(gdb)] dbg_mem_access_fn, ) @@ -560,14 +559,15 @@ impl Hypervisor for HypervLinuxDriver { // reset fpu state self.set_fpu(&CommonFpu::default())?; + let interrupt_handle = self.interrupt_handle.clone(); + // run VirtualCPU::run( self.as_mut_hypervisor(), + interrupt_handle, #[cfg(gdb)] dbg_mem_access_fn, - )?; - - Ok(()) + ) } #[instrument(err(Debug), skip_all, parent = Span::current(), level = "Trace")] @@ -643,76 +643,11 @@ impl Hypervisor for HypervLinuxDriver { #[cfg(gdb)] const EXCEPTION_INTERCEPT: hv_message_type = hv_message_type_HVMSG_X64_EXCEPTION_INTERCEPT; - self.interrupt_handle - .tid - .store(unsafe { libc::pthread_self() as u64 }, Ordering::Release); - // Note: if `InterruptHandle::kill()` is called while this thread is **here** - // Cast to internal trait for access to internal methods - let interrupt_handle_internal = - self.interrupt_handle.as_ref() as &dyn super::InterruptHandleInternal; - - // (after set_running_bit but before checking cancel_requested): - // - kill() will stamp cancel_requested with the current generation - // - We will check cancel_requested below and skip the VcpuFd::run() call - // - This is the desired behavior - the kill takes effect immediately - let generation = interrupt_handle_internal.set_running_bit(); - - #[cfg(not(gdb))] - let debug_interrupt = false; - #[cfg(gdb)] - let debug_interrupt = self - .interrupt_handle - .debug_interrupt - .load(Ordering::Relaxed); - - // Don't run the vcpu if `cancel_requested` is set for our generation - // - // Note: if `InterruptHandle::kill()` is called while this thread is **here** - // (after checking cancel_requested but before vcpu.run()): - // - kill() will stamp cancel_requested with the current generation - // - We will proceed with vcpu.run(), but signals will be sent to interrupt it - // - The vcpu will be interrupted and return EINTR (handled below) - let exit_reason = if interrupt_handle_internal - .is_cancel_requested_for_generation(generation) - || debug_interrupt - { - Err(mshv_ioctls::MshvError::from(libc::EINTR)) - } else { - #[cfg(feature = "trace_guest")] - tc.setup_guest_trace(Span::current().context()); - - // Note: if a `InterruptHandle::kill()` called while this thread is **here** - // Then the vcpu will run, but we will keep sending signals to this thread - // to interrupt it until `running` is set to false. The `vcpu_fd::run()` call will - // return either normally with an exit reason, or from being "kicked" by out signal handler, with an EINTR error, - // both of which are fine. - self.vcpu_fd.run() - }; - // Note: if `InterruptHandle::kill()` is called while this thread is **here** - // (after vcpu.run() returns but before clear_running_bit): - // - kill() continues sending signals to this thread (running bit is still set) - // - The signals are harmless (no-op handler), we just need to check cancel_requested - // - We load cancel_requested below to determine if this run was cancelled - let cancel_requested = - interrupt_handle_internal.is_cancel_requested_for_generation(generation); - #[cfg(gdb)] - let debug_interrupt = self - .interrupt_handle - .debug_interrupt - .load(Ordering::Relaxed); - // Note: if `InterruptHandle::kill()` is called while this thread is **here** - // (after loading cancel_requested but before clear_running_bit): - // - kill() stamps cancel_requested with the CURRENT generation (not the one we just loaded) - // - kill() continues sending signals until running bit is cleared - // - The newly stamped cancel_requested will affect the NEXT vcpu.run() call - // - Signals sent now are harmless (no-op handler) - interrupt_handle_internal.clear_running_bit(); - // At this point, running bit is clear so kill() will stop sending signals. - // However, we may still receive delayed signals that were sent before clear_running_bit. - // These stale signals are harmless because: - // - The signal handler is a no-op - // - We check generation matching in cancel_requested before treating EINTR as cancellation - // - If generation doesn't match, we return Retry instead of Cancelled + #[cfg(feature = "trace_guest")] + tc.setup_guest_trace(Span::current().context()); + + let exit_reason = self.vcpu_fd.run(); + let result = match exit_reason { Ok(m) => match m.header.message_type { HALT_MESSAGE => { @@ -793,35 +728,7 @@ impl Hypervisor for HypervLinuxDriver { }, Err(e) => match e.errno() { // We send a signal (SIGRTMIN+offset) to interrupt the vcpu, which causes EINTR - libc::EINTR => { - // Check if cancellation was requested for THIS specific generation. - // If not, the EINTR came from: - // - A debug interrupt (if GDB is enabled) - // - A stale signal from a previous guest call (generation mismatch) - // - A signal meant for a different sandbox on the same thread - // In these cases, we return Retry to continue execution. - if cancel_requested { - interrupt_handle_internal.clear_cancel_requested(); - HyperlightExit::Cancelled() - } else { - #[cfg(gdb)] - if debug_interrupt { - self.interrupt_handle - .debug_interrupt - .store(false, Ordering::Relaxed); - - // If the vCPU was stopped because of an interrupt, we need to - // return a special exit reason so that the gdb thread can handle it - // and resume execution - HyperlightExit::Debug(VcpuStopReason::Interrupt) - } else { - HyperlightExit::Retry() - } - - #[cfg(not(gdb))] - HyperlightExit::Retry() - } - } + libc::EINTR => HyperlightExit::Cancelled(), libc::EAGAIN => HyperlightExit::Retry(), _ => { crate::debug!("mshv Error - Details: Error: {} \n {:#?}", e, &self); @@ -870,10 +777,14 @@ impl Hypervisor for HypervLinuxDriver { self as &mut dyn Hypervisor } - fn interrupt_handle(&self) -> Arc { + fn interrupt_handle(&self) -> Arc { self.interrupt_handle.clone() } + fn clear_cancel(&self) { + self.interrupt_handle.clear_cancel(); + } + #[cfg(crashdump)] fn crashdump_context(&self) -> Result> { if self.rt_cfg.guest_core_dump { @@ -1102,7 +1013,7 @@ impl Hypervisor for HypervLinuxDriver { impl Drop for HypervLinuxDriver { #[instrument(skip_all, parent = Span::current(), level = "Trace")] fn drop(&mut self) { - self.interrupt_handle.dropped.store(true, Ordering::Relaxed); + self.interrupt_handle.set_dropped(); for region in self.sandbox_regions.iter().chain(self.mmap_regions.iter()) { let mshv_region: mshv_user_mem_region = region.to_owned().into(); match self.vm_fd.unmap_user_memory(mshv_region) { diff --git a/src/hyperlight_host/src/hypervisor/hyperv_windows.rs b/src/hyperlight_host/src/hypervisor/hyperv_windows.rs index ca1f75997..1be0ea502 100644 --- a/src/hyperlight_host/src/hypervisor/hyperv_windows.rs +++ b/src/hyperlight_host/src/hypervisor/hyperv_windows.rs @@ -17,17 +17,14 @@ limitations under the License. use std::fmt; use std::fmt::{Debug, Formatter}; use std::string::String; -use std::sync::atomic::{AtomicBool, AtomicU64, Ordering}; +use std::sync::atomic::{AtomicBool, AtomicU8}; use std::sync::{Arc, Mutex}; use log::LevelFilter; use tracing::{Span, instrument}; #[cfg(feature = "trace_guest")] use tracing_opentelemetry::OpenTelemetrySpanExt; -use windows::Win32::System::Hypervisor::{ - WHV_MEMORY_ACCESS_TYPE, WHV_PARTITION_HANDLE, WHV_RUN_VP_EXIT_CONTEXT, WHV_RUN_VP_EXIT_REASON, - WHvCancelRunVirtualProcessor, -}; +use windows::Win32::System::Hypervisor::{WHV_MEMORY_ACCESS_TYPE, WHV_RUN_VP_EXIT_REASON}; #[cfg(crashdump)] use {super::crashdump, std::path::Path}; #[cfg(gdb)] @@ -46,7 +43,7 @@ use super::windows_hypervisor_platform::{VMPartition, VMProcessor}; use super::wrappers::HandleWrapper; use super::{HyperlightExit, Hypervisor, InterruptHandle, VirtualCPU}; use crate::hypervisor::regs::{CommonFpu, CommonRegisters}; -use crate::hypervisor::{InterruptHandleInternal, get_memory_access_violation}; +use crate::hypervisor::{InterruptHandleImpl, WindowsInterruptHandle, get_memory_access_violation}; use crate::mem::memory_region::{MemoryRegion, MemoryRegionFlags}; use crate::mem::mgr::SandboxMemoryManager; use crate::mem::ptr::{GuestPtr, RawPtr}; @@ -263,7 +260,7 @@ pub(crate) struct HypervWindowsDriver { _surrogate_process: SurrogateProcess, // we need to keep a reference to the SurrogateProcess for the duration of the driver since otherwise it will dropped and the memory mapping will be unmapped and the surrogate process will be returned to the pool entrypoint: u64, orig_rsp: GuestPtr, - interrupt_handle: Arc, + interrupt_handle: Arc, mem_mgr: Option>, host_funcs: Option>>, @@ -327,11 +324,7 @@ impl HypervWindowsDriver { }; let interrupt_handle = Arc::new(WindowsInterruptHandle { - running: AtomicU64::new(0), - cancel_requested: AtomicU64::new(0), - #[cfg(gdb)] - debug_interrupt: AtomicBool::new(false), - call_active: AtomicBool::new(false), + state: AtomicU8::new(0), partition_handle, dropped: AtomicBool::new(false), }); @@ -443,8 +436,10 @@ impl Hypervisor for HypervWindowsDriver { }; self.set_regs(®s)?; + let interrupt_handle = self.interrupt_handle.clone(); VirtualCPU::run( self.as_mut_hypervisor(), + interrupt_handle, #[cfg(gdb)] dbg_mem_access_hdl, ) @@ -482,13 +477,13 @@ impl Hypervisor for HypervWindowsDriver { // reset fpu state self.processor.set_fpu(&CommonFpu::default())?; + let interrupt_handle = self.interrupt_handle.clone(); VirtualCPU::run( self.as_mut_hypervisor(), + interrupt_handle, #[cfg(gdb)] dbg_mem_access_hdl, - )?; - - Ok(()) + ) } #[instrument(err(Debug), skip_all, parent = Span::current(), level = "Trace")] @@ -550,58 +545,10 @@ impl Hypervisor for HypervWindowsDriver { &mut self, #[cfg(feature = "trace_guest")] tc: &mut crate::sandbox::trace::TraceContext, ) -> Result { - // Cast to internal trait for access to internal methods - let interrupt_handle_internal = - self.interrupt_handle.as_ref() as &dyn super::InterruptHandleInternal; - - // Get current generation and set running bit - let generation = interrupt_handle_internal.set_running_bit(); + #[cfg(feature = "trace_guest")] + tc.setup_guest_trace(Span::current().context()); - #[cfg(not(gdb))] - let debug_interrupt = false; - #[cfg(gdb)] - let debug_interrupt = self - .interrupt_handle - .debug_interrupt - .load(Ordering::Relaxed); - - // Check if cancellation was requested for THIS generation - let exit_context = if interrupt_handle_internal - .is_cancel_requested_for_generation(generation) - || debug_interrupt - { - WHV_RUN_VP_EXIT_CONTEXT { - ExitReason: WHV_RUN_VP_EXIT_REASON(8193i32), // WHvRunVpExitReasonCanceled - VpContext: Default::default(), - Anonymous: Default::default(), - Reserved: Default::default(), - } - } else { - #[cfg(feature = "trace_guest")] - tc.setup_guest_trace(Span::current().context()); - - self.processor.run()? - }; - - // Clear running bit - interrupt_handle_internal.clear_running_bit(); - - let is_canceled = exit_context.ExitReason == WHV_RUN_VP_EXIT_REASON(8193i32); // WHvRunVpExitReasonCanceled - - // Check if this was a manual cancellation (vs internal Windows cancellation) - let cancel_was_requested_manually = - interrupt_handle_internal.is_cancel_requested_for_generation(generation); - - // Only clear cancel_requested if we're actually processing a cancellation for this generation - if is_canceled && cancel_was_requested_manually { - interrupt_handle_internal.clear_cancel_requested(); - } - - #[cfg(gdb)] - let debug_interrupt = self - .interrupt_handle - .debug_interrupt - .load(Ordering::Relaxed); + let exit_context = self.processor.run()?; let result = match exit_context.ExitReason { // WHvRunVpExitReasonX64IoPortAccess @@ -658,45 +605,10 @@ impl Hypervisor for HypervWindowsDriver { } // WHvRunVpExitReasonCanceled // Execution was cancelled by the host. - // This will happen when guest code runs for too long WHV_RUN_VP_EXIT_REASON(8193i32) => { debug!("HyperV Cancelled Details :\n {:#?}", &self); - #[cfg(gdb)] - if debug_interrupt { - self.interrupt_handle - .debug_interrupt - .store(false, Ordering::Relaxed); - - // If the vCPU was stopped because of an interrupt, we need to - // return a special exit reason so that the gdb thread can handle it - // and resume execution - HyperlightExit::Debug(VcpuStopReason::Interrupt) - } else if !cancel_was_requested_manually { - // This was an internal cancellation - // The virtualization stack can use this function to return the control - // of a virtual processor back to the virtualization stack in case it - // needs to change the state of a VM or to inject an event into the processor - // see https://learn.microsoft.com/en-us/virtualization/api/hypervisor-platform/funcs/whvcancelrunvirtualprocessor#remarks - debug!("Internal cancellation detected, returning Retry error"); - HyperlightExit::Retry() - } else { - HyperlightExit::Cancelled() - } - #[cfg(not(gdb))] - { - if !cancel_was_requested_manually { - // This was an internal cancellation - // The virtualization stack can use this function to return the control - // of a virtual processor back to the virtualization stack in case it - // needs to change the state of a VM or to inject an event into the processor - // see https://learn.microsoft.com/en-us/virtualization/api/hypervisor-platform/funcs/whvcancelrunvirtualprocessor#remarks - debug!("Internal cancellation detected, returning Retry error"); - HyperlightExit::Retry() - } else { - HyperlightExit::Cancelled() - } - } + HyperlightExit::Cancelled() } #[cfg(gdb)] WHV_RUN_VP_EXIT_REASON(4098i32) => { @@ -754,10 +666,14 @@ impl Hypervisor for HypervWindowsDriver { self.processor.set_sregs(sregs) } - fn interrupt_handle(&self) -> Arc { + fn interrupt_handle(&self) -> Arc { self.interrupt_handle.clone() } + fn clear_cancel(&self) { + self.interrupt_handle.clear_cancel(); + } + #[instrument(skip_all, parent = Span::current(), level = "Trace")] fn as_mut_hypervisor(&mut self) -> &mut dyn Hypervisor { self as &mut dyn Hypervisor @@ -990,86 +906,6 @@ impl Hypervisor for HypervWindowsDriver { impl Drop for HypervWindowsDriver { fn drop(&mut self) { - self.interrupt_handle.dropped.store(true, Ordering::Relaxed); - } -} - -#[derive(Debug)] -pub struct WindowsInterruptHandle { - /// Combined running flag (bit 63) and generation counter (bits 0-62). - /// - /// The generation increments with each guest function call to prevent - /// stale cancellations from affecting new calls (ABA problem). - /// - /// Layout: `[running:1 bit][generation:63 bits]` - running: AtomicU64, - - /// Combined cancel_requested flag (bit 63) and generation counter (bits 0-62). - /// - /// When kill() is called, this stores the current generation along with - /// the cancellation flag. The VCPU only honors the cancellation if the - /// generation matches its current generation. - /// - /// Layout: `[cancel_requested:1 bit][generation:63 bits]` - cancel_requested: AtomicU64, - - // This is used to signal the GDB thread to stop the vCPU - #[cfg(gdb)] - debug_interrupt: AtomicBool, - /// Flag indicating whether a guest function call is currently in progress. - /// - /// **true**: A guest function call is active (between call start and completion) - /// **false**: No guest function call is active - /// - /// # Purpose - /// - /// This flag prevents kill() from having any effect when called outside of a - /// guest function call. This solves the "kill-in-advance" problem where kill() - /// could be called before a guest function starts and would incorrectly cancel it. - call_active: AtomicBool, - partition_handle: WHV_PARTITION_HANDLE, - dropped: AtomicBool, -} - -impl InterruptHandle for WindowsInterruptHandle { - fn kill(&self) -> bool { - // Check if a call is actually active first - if !self.call_active.load(Ordering::Acquire) { - return false; - } - - // Get the current running state and generation - let (running, generation) = self.get_running_and_generation(); - - // Set cancel_requested with the current generation - self.set_cancel_requested(generation); - - // Only call WHvCancelRunVirtualProcessor if VCPU is actually running in guest mode - running && unsafe { WHvCancelRunVirtualProcessor(self.partition_handle, 0, 0).is_ok() } - } - - #[cfg(gdb)] - fn kill_from_debugger(&self) -> bool { - self.debug_interrupt.store(true, Ordering::Relaxed); - let (running, _) = self.get_running_and_generation(); - running && unsafe { WHvCancelRunVirtualProcessor(self.partition_handle, 0, 0).is_ok() } - } - - fn dropped(&self) -> bool { - self.dropped.load(Ordering::Relaxed) - } -} - -impl InterruptHandleInternal for WindowsInterruptHandle { - fn get_call_active(&self) -> &AtomicBool { - &self.call_active - } - - fn get_running(&self) -> &AtomicU64 { - &self.running - } - - fn get_cancel_requested(&self) -> &AtomicU64 { - &self.cancel_requested + self.interrupt_handle.set_dropped(); } } diff --git a/src/hyperlight_host/src/hypervisor/kvm.rs b/src/hyperlight_host/src/hypervisor/kvm.rs index 04b8ed60f..032468d2d 100644 --- a/src/hyperlight_host/src/hypervisor/kvm.rs +++ b/src/hyperlight_host/src/hypervisor/kvm.rs @@ -15,7 +15,7 @@ limitations under the License. */ use std::fmt::Debug; -use std::sync::atomic::{AtomicBool, AtomicU64, Ordering}; +use std::sync::atomic::{AtomicBool, AtomicU8, AtomicU64}; use std::sync::{Arc, Mutex}; use kvm_bindings::{kvm_fpu, kvm_regs, kvm_sregs, kvm_userspace_memory_region}; @@ -36,8 +36,8 @@ use super::gdb::{ use super::{HyperlightExit, Hypervisor, LinuxInterruptHandle, VirtualCPU}; #[cfg(gdb)] use crate::HyperlightError; -use crate::hypervisor::get_memory_access_violation; use crate::hypervisor::regs::{CommonFpu, CommonRegisters}; +use crate::hypervisor::{InterruptHandle, InterruptHandleImpl, get_memory_access_violation}; use crate::mem::memory_region::{MemoryRegion, MemoryRegionFlags}; use crate::mem::mgr::SandboxMemoryManager; use crate::mem::ptr::{GuestPtr, RawPtr}; @@ -271,7 +271,7 @@ pub(crate) struct KVMDriver { vcpu_fd: VcpuFd, entrypoint: u64, orig_rsp: GuestPtr, - interrupt_handle: Arc, + interrupt_handle: Arc, mem_mgr: Option>, host_funcs: Option>>, @@ -332,12 +332,8 @@ impl KVMDriver { let rsp_gp = GuestPtr::try_from(RawPtr::from(rsp))?; - let interrupt_handle = Arc::new(LinuxInterruptHandle { - running: AtomicU64::new(0), - cancel_requested: AtomicU64::new(0), - call_active: AtomicBool::new(false), - #[cfg(gdb)] - debug_interrupt: AtomicBool::new(false), + let interrupt_handle: Arc = Arc::new(LinuxInterruptHandle { + state: AtomicU8::new(0), #[cfg(all( target_arch = "x86_64", target_vendor = "unknown", @@ -353,8 +349,8 @@ impl KVMDriver { )))] tid: AtomicU64::new(unsafe { libc::pthread_self() }), retry_delay: config.get_interrupt_retry_delay(), - dropped: AtomicBool::new(false), sig_rt_min_offset: config.get_interrupt_vcpu_sigrtmin_offset(), + dropped: AtomicBool::new(false), }); let mut kvm = Self { @@ -460,8 +456,11 @@ impl Hypervisor for KVMDriver { }; self.set_regs(®s)?; + let interrupt_handle = self.interrupt_handle.clone(); + VirtualCPU::run( self.as_mut_hypervisor(), + interrupt_handle, #[cfg(gdb)] dbg_mem_access_fn, ) @@ -543,9 +542,12 @@ impl Hypervisor for KVMDriver { // reset fpu state self.set_fpu(&CommonFpu::default())?; + let interrupt_handle = self.interrupt_handle.clone(); + // run VirtualCPU::run( self.as_mut_hypervisor(), + interrupt_handle, #[cfg(gdb)] dbg_mem_access_fn, )?; @@ -619,76 +621,10 @@ impl Hypervisor for KVMDriver { &mut self, #[cfg(feature = "trace_guest")] tc: &mut crate::sandbox::trace::TraceContext, ) -> Result { - self.interrupt_handle - .tid - .store(unsafe { libc::pthread_self() as u64 }, Ordering::Release); - // Note: if `InterruptHandle::kill()` is called while this thread is **here** - // Cast to internal trait for access to internal methods - let interrupt_handle_internal = - self.interrupt_handle.as_ref() as &dyn super::InterruptHandleInternal; - - // (after set_running_bit but before checking cancel_requested): - // - kill() will stamp cancel_requested with the current generation - // - We will check cancel_requested below and skip the VcpuFd::run() call - // - This is the desired behavior - the kill takes effect immediately - let generation = interrupt_handle_internal.set_running_bit(); - - #[cfg(not(gdb))] - let debug_interrupt = false; - #[cfg(gdb)] - let debug_interrupt = self - .interrupt_handle - .debug_interrupt - .load(Ordering::Relaxed); - // Don't run the vcpu if `cancel_requested` is set for our generation - // - // Note: if `InterruptHandle::kill()` is called while this thread is **here** - // (after checking cancel_requested but before vcpu.run()): - // - kill() will stamp cancel_requested with the current generation - // - We will proceed with vcpu.run(), but signals will be sent to interrupt it - // - The vcpu will be interrupted and return EINTR (handled below) - let exit_reason = if interrupt_handle_internal - .is_cancel_requested_for_generation(generation) - || debug_interrupt - { - Err(kvm_ioctls::Error::new(libc::EINTR)) - } else { - #[cfg(feature = "trace_guest")] - tc.setup_guest_trace(Span::current().context()); - - // Note: if `InterruptHandle::kill()` is called while this thread is **here** - // (during vcpu.run() execution): - // - kill() stamps cancel_requested with the current generation - // - kill() sends signals (SIGRTMIN+offset) to this thread repeatedly - // - The signal handler is a no-op, but it causes vcpu.run() to return EINTR - // - We check cancel_requested below and return Cancelled if generation matches - self.vcpu_fd.run() - }; - // Note: if `InterruptHandle::kill()` is called while this thread is **here** - // (after vcpu.run() returns but before clear_running_bit): - // - kill() continues sending signals to this thread (running bit is still set) - // - The signals are harmless (no-op handler), we just need to check cancel_requested - // - We load cancel_requested below to determine if this run was cancelled - let cancel_requested = - interrupt_handle_internal.is_cancel_requested_for_generation(generation); - #[cfg(gdb)] - let debug_interrupt = self - .interrupt_handle - .debug_interrupt - .load(Ordering::Relaxed); - // Note: if `InterruptHandle::kill()` is called while this thread is **here** - // (after loading cancel_requested but before clear_running_bit): - // - kill() stamps cancel_requested with the CURRENT generation (not the one we just loaded) - // - kill() continues sending signals until running bit is cleared - // - The newly stamped cancel_requested will affect the NEXT vcpu.run() call - // - Signals sent now are harmless (no-op handler) - interrupt_handle_internal.clear_running_bit(); - // At this point, running bit is clear so kill() will stop sending signals. - // However, we may still receive delayed signals that were sent before clear_running_bit. - // These stale signals are harmless because: - // - The signal handler is a no-op - // - We check generation matching in cancel_requested before treating EINTR as cancellation - // - If generation doesn't match, we return Retry instead of Cancelled + #[cfg(feature = "trace_guest")] + tc.setup_guest_trace(Span::current().context()); + + let exit_reason = self.vcpu_fd.run(); let result = match exit_reason { Ok(VcpuExit::Hlt) => { crate::debug!("KVM - Halt Details : {:#?}", &self); @@ -738,35 +674,7 @@ impl Hypervisor for KVMDriver { }, Err(e) => match e.errno() { // We send a signal (SIGRTMIN+offset) to interrupt the vcpu, which causes EINTR - libc::EINTR => { - // Check if cancellation was requested for THIS specific generation. - // If not, the EINTR came from: - // - A debug interrupt (if GDB is enabled) - // - A stale signal from a previous guest call (generation mismatch) - // - A signal meant for a different sandbox on the same thread - // In these cases, we return Retry to continue execution. - if cancel_requested { - interrupt_handle_internal.clear_cancel_requested(); - HyperlightExit::Cancelled() - } else { - #[cfg(gdb)] - if debug_interrupt { - self.interrupt_handle - .debug_interrupt - .store(false, Ordering::Relaxed); - - // If the vCPU was stopped because of an interrupt, we need to - // return a special exit reason so that the gdb thread can handle it - // and resume execution - HyperlightExit::Debug(VcpuStopReason::Interrupt) - } else { - HyperlightExit::Retry() - } - - #[cfg(not(gdb))] - HyperlightExit::Retry() - } - } + libc::EINTR => HyperlightExit::Cancelled(), libc::EAGAIN => HyperlightExit::Retry(), _ => { crate::debug!("KVM Error -Details: Address: {} \n {:#?}", e, &self); @@ -820,10 +728,14 @@ impl Hypervisor for KVMDriver { self as &mut dyn Hypervisor } - fn interrupt_handle(&self) -> Arc { + fn interrupt_handle(&self) -> Arc { self.interrupt_handle.clone() } + fn clear_cancel(&self) { + self.interrupt_handle.clear_cancel(); + } + #[cfg(crashdump)] fn crashdump_context(&self) -> Result> { if self.rt_cfg.guest_core_dump { @@ -1057,6 +969,6 @@ impl Hypervisor for KVMDriver { impl Drop for KVMDriver { fn drop(&mut self) { - self.interrupt_handle.dropped.store(true, Ordering::Relaxed); + self.interrupt_handle.set_dropped(); } } diff --git a/src/hyperlight_host/src/hypervisor/mod.rs b/src/hyperlight_host/src/hypervisor/mod.rs index f5575b989..7a4c31014 100644 --- a/src/hyperlight_host/src/hypervisor/mod.rs +++ b/src/hyperlight_host/src/hypervisor/mod.rs @@ -63,7 +63,9 @@ pub(crate) mod crashdump; use std::fmt::Debug; use std::str::FromStr; -use std::sync::atomic::{AtomicBool, AtomicU64, Ordering}; +#[cfg(any(kvm, mshv3))] +use std::sync::atomic::AtomicU64; +use std::sync::atomic::{AtomicBool, AtomicU8, Ordering}; use std::sync::{Arc, Mutex}; #[cfg(any(kvm, mshv3))] use std::time::Duration; @@ -178,7 +180,10 @@ pub(crate) trait Hypervisor: Debug + Send { ) -> Result; /// Get InterruptHandle to underlying VM (returns internal trait) - fn interrupt_handle(&self) -> Arc; + fn interrupt_handle(&self) -> Arc; + + /// Clear the cancellation flag + fn clear_cancel(&self); /// Get regs #[allow(dead_code)] @@ -356,6 +361,7 @@ impl VirtualCPU { #[instrument(err(Debug), skip_all, parent = Span::current(), level = "Trace")] pub(crate) fn run( hv: &mut dyn Hypervisor, + interrupt_handle: Arc, #[cfg(gdb)] dbg_mem_access_fn: Arc>>, ) -> Result<()> { // Keeps the trace context and open spans @@ -363,26 +369,63 @@ impl VirtualCPU { let mut tc = crate::sandbox::trace::TraceContext::new(); loop { - #[cfg(feature = "trace_guest")] - let result = { - let result = hv.run(&mut tc); - // End current host trace by closing the current span that captures traces - // happening when a guest exits and re-enters. - tc.end_host_trace(); - - // Handle the guest trace data if any - if let Err(e) = hv.handle_trace(&mut tc) { - // If no trace data is available, we just log a message and continue - // Is this the right thing to do? - log::debug!("Error handling guest trace: {:?}", e); - } + // ===== KILL() TIMING POINT 2: Before set_running() ===== + // If kill() is called and ran to completion BEFORE this line executes: + // - CANCEL_BIT will be set and we will return an early VmExit::Cancelled() + // without sending any signals/WHV api calls + #[cfg(any(kvm, mshv3))] + interrupt_handle.set_tid(); + interrupt_handle.set_running(); + // NOTE: `set_running()`` must be called before checking `is_cancelled()` + // otherwise we risk missing a call to `kill()` because the vcpu would not be marked as running yet so signals won't be sent + + let exit_reason = { + if interrupt_handle.is_cancelled() || interrupt_handle.is_debug_interrupted() { + Ok(HyperlightExit::Cancelled()) + } else { + // ==== KILL() TIMING POINT 3: Before calling run() ==== + // If kill() is called and ran to completion BEFORE this line executes: + // - Will still do a VM entry, but signals will be sent until VM exits + #[cfg(feature = "trace_guest")] + let result = hv.run(&mut tc); + #[cfg(not(feature = "trace_guest"))] + let result = hv.run(); + + // End current host trace by closing the current span that captures traces + // happening when a guest exits and re-enters. + #[cfg(feature = "trace_guest")] + tc.end_host_trace(); + + // Handle the guest trace data if any + #[cfg(feature = "trace_guest")] + if let Err(e) = hv.handle_trace(&mut tc) { + // If no trace data is available, we just log a message and continue + // Is this the right thing to do? + log::debug!("Error handling guest trace: {:?}", e); + } - result + result + } }; - #[cfg(not(feature = "trace_guest"))] - let result = hv.run(); - match result { + // ===== KILL() TIMING POINT 4: Before clear_running() ===== + // If kill() is called and ran to completion BEFORE this line executes: + // - CANCEL_BIT will be set. Cancellation is deferred to the next iteration. + // - Signals will be sent until `clear_running()` is called, which is ok + interrupt_handle.clear_running(); + + // ===== KILL() TIMING POINT 5: Before capturing cancel_requested ===== + // If kill() is called and ran to completion BEFORE this line executes: + // - CANCEL_BIT will be set. Cancellation is deferred to the next iteration. + // - Signals will not be sent + let cancel_requested = interrupt_handle.is_cancelled(); + let debug_interrupted = interrupt_handle.is_debug_interrupted(); + + // ===== KILL() TIMING POINT 6: Before checking exit_reason ===== + // If kill() is called and ran to completion BEFORE this line executes: + // - CANCEL_BIT will be set. Cancellation is deferred to the next iteration. + // - Signals will not be sent + match exit_reason { #[cfg(gdb)] Ok(HyperlightExit::Debug(stop_reason)) => { if let Err(e) = hv.handle_debug(dbg_mem_access_fn.clone(), stop_reason) { @@ -425,6 +468,24 @@ impl VirtualCPU { )); } Ok(HyperlightExit::Cancelled()) => { + // If cancellation was not requested for this specific guest function call, + // the vcpu was interrupted by a stale cancellation from a previous call + if !cancel_requested && !debug_interrupted { + // treat this the same as a HyperlightExit::Retry, the cancel was not meant for this call + continue; + } + + // If the vcpu was interrupted by a debugger, we need to handle it + #[cfg(gdb)] + { + interrupt_handle.clear_debug_interrupt(); + if let Err(e) = + hv.handle_debug(dbg_mem_access_fn.clone(), VcpuStopReason::Interrupt) + { + log_then_return!(e); + } + } + // Shutdown is returned when the host has cancelled execution // After termination, the main thread will re-initialize the VM metrics::counter!(METRIC_GUEST_CANCELLATION).increment(1); @@ -461,54 +522,44 @@ impl VirtualCPU { } } -/// A trait for handling interrupts to a sandbox's vcpu (public API) -pub trait InterruptHandle: Debug + Send + Sync { +/// A trait for platform-specific interrupt handle implementation details +pub(crate) trait InterruptHandleImpl: InterruptHandle { + /// Set the thread ID for the vcpu thread + #[cfg(any(kvm, mshv3))] + fn set_tid(&self); + + /// Set the running state + fn set_running(&self); + + /// Clear the running state + fn clear_running(&self); + + /// Mark the handle as dropped + fn set_dropped(&self); + + /// Check if cancellation was requested + fn is_cancelled(&self) -> bool; + + /// Clear the cancellation request flag + fn clear_cancel(&self); + + /// Check if debug interrupt was requested (always returns false when gdb feature is disabled) + fn is_debug_interrupted(&self) -> bool; + + // Clear the debug interrupt request flag + #[cfg(gdb)] + fn clear_debug_interrupt(&self); +} + +/// A trait for handling interrupts to a sandbox's vcpu +pub trait InterruptHandle: Send + Sync + Debug { /// Interrupt the corresponding sandbox from running. /// - /// This method attempts to cancel a currently executing guest function call by sending - /// a signal to the VCPU thread. It uses generation tracking and call_active flag to - /// ensure the interruption is safe and precise. - /// - /// # Behavior - /// - /// - **Guest function running**: If called while a guest function is executing (VCPU running - /// or in a host function call), this stamps the current generation into cancel_requested - /// and sends a signal to interrupt the VCPU. Returns `true`. - /// - /// - **No active call**: If called when no guest function call is in progress (call_active=false), - /// this has no effect and returns `false`. This prevents "kill-in-advance" where kill() - /// is called before a guest function starts. - /// - /// - **During host function**: If the guest call is currently executing a host function - /// (VCPU not running but call_active=true), this stamps cancel_requested. When the - /// host function returns and attempts to re-enter the guest, the cancellation will - /// be detected and the call will abort. Returns `true`. - /// - /// # Generation Tracking - /// - /// The method stamps the current generation number along with the cancellation request. - /// This ensures that: - /// - Stale signals from previous calls are ignored (generation mismatch) - /// - Only the intended guest function call is affected - /// - Multiple rapid kill() calls on the same generation are idempotent - /// - /// # Blocking Behavior - /// - /// This function will block while attempting to deliver the signal to the VCPU thread, - /// retrying until either: - /// - The signal is successfully delivered (VCPU transitions from running to not running) - /// - The VCPU stops running for another reason (e.g., call completes normally) - /// - /// # Returns - /// - /// - `true`: Cancellation request was stamped (kill will take effect) - /// - `false`: No active call, cancellation request was not stamped (no effect) + /// - If this is called while the the sandbox currently executing a guest function call, it will interrupt the sandbox and return `true`. + /// - If this is called while the sandbox is not running (for example before or after calling a guest function), it will do nothing and return `false`. /// /// # Note - /// - /// To reliably interrupt a guest call, ensure `kill()` is called while the guest - /// function is actually executing. Calling kill() before call_guest_function() will - /// have no effect. + /// This function will block for the duration of the time it takes for the vcpu thread to be interrupted. fn kill(&self) -> bool; /// Used by a debugger to interrupt the corresponding sandbox from running. @@ -523,322 +574,82 @@ pub trait InterruptHandle: Debug + Send + Sync { #[cfg(gdb)] fn kill_from_debugger(&self) -> bool; - /// Check if the corresponding VM has been dropped. + /// Returns true if the corresponding sandbox has been dropped fn dropped(&self) -> bool; } -/// Internal trait for interrupt handle implementation details (private, cross-platform). -/// -/// This trait contains all the internal atomics access methods and helper functions -/// that are shared between Linux and Windows implementations. It extends InterruptHandle -/// to inherit the public API. -/// -/// This trait should NOT be used outside of hypervisor implementations. -pub(crate) trait InterruptHandleInternal: InterruptHandle { - /// Returns the call_active atomic bool reference for internal implementations. - fn get_call_active(&self) -> &AtomicBool; - - /// Returns the running atomic u64 reference for internal implementations. - fn get_running(&self) -> &AtomicU64; - - /// Returns the cancel_requested atomic u64 reference for internal implementations. - fn get_cancel_requested(&self) -> &AtomicU64; - - /// Set call_active - increments generation and sets flag. - /// - /// Increments the generation counter and sets the call_active flag to true, - /// indicating that a guest function call is now in progress. This allows - /// kill() to stamp cancel_requested with the correct generation. - /// - /// Must be called at the start of call_guest_function_by_name_no_reset(), - /// before any VCPU execution begins. - /// - /// Returns true if call_active was already set (indicating a guard already exists), - /// false otherwise. - fn set_call_active(&self) -> bool { - self.increment_generation(); - self.get_call_active().swap(true, Ordering::AcqRel) - } - - /// Clear call_active - clears the call_active flag. - /// - /// Clears the call_active flag, indicating that no guest function call is - /// in progress. After this, kill() will have no effect and will return false. - /// - /// Must be called at the end of call_guest_function_by_name_no_reset(), - /// after the guest call has fully completed (whether successfully or with error). - fn clear_call_active(&self) { - self.get_call_active().store(false, Ordering::Release) - } - - /// Set cancel_requested to true with the given generation. - /// - /// This stamps the cancellation request with the current generation number, - /// ensuring that only the VCPU running with this exact generation will honor - /// the cancellation. - fn set_cancel_requested(&self, generation: u64) { - const CANCEL_REQUESTED_BIT: u64 = 1 << 63; - const MAX_GENERATION: u64 = CANCEL_REQUESTED_BIT - 1; - let value = CANCEL_REQUESTED_BIT | (generation & MAX_GENERATION); - self.get_cancel_requested().store(value, Ordering::Release); - } - - /// Clear cancel_requested (reset to no cancellation). - /// - /// This is called after a cancellation has been processed to reset the - /// cancellation flag for the next guest call. - fn clear_cancel_requested(&self) { - self.get_cancel_requested().store(0, Ordering::Release); - } - - /// Check if cancel_requested is set for the given generation. - /// - /// Returns true only if BOTH: - /// - The cancellation flag is set - /// - The stored generation matches the provided generation - /// - /// This prevents stale cancellations from affecting new guest calls. - fn is_cancel_requested_for_generation(&self, generation: u64) -> bool { - const CANCEL_REQUESTED_BIT: u64 = 1 << 63; - const MAX_GENERATION: u64 = CANCEL_REQUESTED_BIT - 1; - let raw = self.get_cancel_requested().load(Ordering::Acquire); - let is_set = raw & CANCEL_REQUESTED_BIT != 0; - let stored_generation = raw & MAX_GENERATION; - is_set && stored_generation == generation - } - - /// Set running bit to true, return current generation. - /// - /// This is called when the VCPU is about to enter guest mode. It atomically - /// sets the running flag while preserving the generation counter. - fn set_running_bit(&self) -> u64 { - const RUNNING_BIT: u64 = 1 << 63; - self.get_running() - .fetch_update(Ordering::Release, Ordering::Acquire, |raw| { - Some(raw | RUNNING_BIT) - }) - .map(|raw| raw & !RUNNING_BIT) // Return the current generation - .unwrap_or(0) - } - - /// Increment the generation for a new guest function call. - /// - /// The generation counter wraps around at MAX_GENERATION (2^63 - 1). - /// This is called at the start of each new guest function call to provide - /// a unique identifier that prevents ABA problems with stale cancellations. - /// - /// Returns the NEW generation number (after incrementing). - fn increment_generation(&self) -> u64 { - const RUNNING_BIT: u64 = 1 << 63; - const MAX_GENERATION: u64 = RUNNING_BIT - 1; - self.get_running() - .fetch_update(Ordering::Release, Ordering::Acquire, |raw| { - let current_generation = raw & !RUNNING_BIT; - let running_bit = raw & RUNNING_BIT; - if current_generation == MAX_GENERATION { - // Restart generation from 0 - return Some(running_bit); - } - Some((current_generation + 1) | running_bit) - }) - .map(|raw| (raw & !RUNNING_BIT) + 1) // Return the NEW generation - .unwrap_or(1) // If wrapped, return 1 - } - - /// Get the current running state and generation counter. - /// - /// Returns a tuple of (running, generation) where: - /// - running: true if VCPU is currently in guest mode - /// - generation: current generation counter value - fn get_running_and_generation(&self) -> (bool, u64) { - const RUNNING_BIT: u64 = 1 << 63; - let raw = self.get_running().load(Ordering::Acquire); - let running = raw & RUNNING_BIT != 0; - let generation = raw & !RUNNING_BIT; - (running, generation) - } - - /// Clear the running bit and return the old value. - /// - /// This is called when the VCPU exits from guest mode back to host mode. - /// The return value (which includes the generation and the old running bit) - /// is currently unused by all callers. - fn clear_running_bit(&self) -> u64 { - const RUNNING_BIT: u64 = 1 << 63; - self.get_running() - .fetch_and(!RUNNING_BIT, Ordering::Release) - } -} - #[cfg(any(kvm, mshv3))] #[derive(Debug)] pub(super) struct LinuxInterruptHandle { - /// Atomic flag combining running state and generation counter. - /// - /// **Bit 63**: VCPU running state (1 = running, 0 = not running) - /// **Bits 0-62**: Generation counter (incremented once per guest function call) - /// - /// # Generation Tracking - /// - /// The generation counter is incremented once at the start of each guest function call - /// and remains constant throughout that call, even if the VCPU is run multiple times - /// (due to host function calls, retries, etc.). This design solves the race condition - /// where a kill() from a previous call could spuriously cancel a new call. - /// - /// ## Why Generations Are Needed + /// Atomic value packing vcpu execution state. /// - /// Consider this scenario WITHOUT generation tracking: - /// 1. Thread A starts guest call 1, VCPU runs - /// 2. Thread B calls kill(), sends signal to Thread A - /// 3. Guest call 1 completes before signal arrives - /// 4. Thread A starts guest call 2, VCPU runs again - /// 5. Stale signal from step 2 arrives and incorrectly cancels call 2 + /// Bit layout: + /// - Bit 2: DEBUG_INTERRUPT_BIT - set when debugger interrupt is requested + /// - Bit 1: RUNNING_BIT - set when vcpu is actively running + /// - Bit 0: CANCEL_BIT - set when cancellation has been requested /// - /// WITH generation tracking: - /// 1. Thread A starts guest call 1 (generation N), VCPU runs - /// 2. Thread B calls kill(), stamps cancel_requested with generation N - /// 3. Guest call 1 completes, signal may or may not have arrived yet - /// 4. Thread A starts guest call 2 (generation N+1), VCPU runs again - /// 5. If stale signal arrives, signal handler checks: cancel_requested.generation (N) != current generation (N+1) - /// 6. Stale signal is ignored, call 2 continues normally - /// - /// ## Per-Call vs Per-Run Generation - /// - /// It's critical that generation is incremented per GUEST FUNCTION CALL, not per vcpu.run(): - /// - A single guest function call may invoke vcpu.run() multiple times (host calls, retries) - /// - All run() calls within the same guest call must share the same generation - /// - This ensures kill() affects the entire guest function call atomically - /// - /// # Invariants - /// - /// - If VCPU is running: bit 63 is set (neither converse nor inverse holds) - /// - If VCPU is running: bits 0-62 match the current guest call's generation - running: AtomicU64, + /// CANCEL_BIT persists across vcpu exits/re-entries within a single `VirtualCPU::run()` call + /// (e.g., during host function calls), but is cleared at the start of each new `VirtualCPU::run()` call. + state: AtomicU8, - /// Thread ID where the VCPU is currently running. - /// - /// # Invariants + /// Thread ID where the vcpu is running. /// - /// - If VCPU is running: tid contains the thread ID of the executing thread - /// - Multiple VMs may share the same tid, but at most one will have running=true + /// Note: Multiple VMs may have the same `tid` (same thread runs multiple sandboxes sequentially), + /// but at most one VM will have RUNNING_BIT set at any given time. tid: AtomicU64, - /// Generation-aware cancellation request flag. - /// - /// **Bit 63**: Cancellation requested flag (1 = kill requested, 0 = no kill) - /// **Bits 0-62**: Generation number when cancellation was requested - /// - /// # Purpose - /// - /// This flag serves three critical functions: - /// - /// 1. **Prevent stale signals**: A VCPU may only be interrupted if cancel_requested - /// is set AND the generation matches the current call's generation - /// - /// 2. **Handle host function calls**: If kill() is called while a host function is - /// executing (VCPU not running but call is active), cancel_requested is stamped - /// with the current generation. When the host function returns and the VCPU - /// attempts to re-enter the guest, it will see the cancellation and abort. - /// - /// 3. **Detect stale kills**: If cancel_requested.generation doesn't match the - /// current generation, it's from a previous call and should be ignored - /// - /// # States and Transitions - /// - /// - **No cancellation**: cancel_requested = 0 (bit 63 clear) - /// - **Cancellation for generation N**: cancel_requested = (1 << 63) | N - /// - Signal handler checks: (cancel_requested & 0x7FFFFFFFFFFFFFFF) == current_generation - cancel_requested: AtomicU64, - - /// Flag indicating whether a guest function call is currently in progress. - /// - /// **true**: A guest function call is active (between call start and completion) - /// **false**: No guest function call is active - /// - /// # Purpose - /// - /// This flag prevents kill() from having any effect when called outside of a - /// guest function call. This solves the "kill-in-advance" problem where kill() - /// could be called before a guest function starts and would incorrectly cancel it. - /// - /// # Behavior - /// - /// - Set to true at the start of call_guest_function_by_name_no_reset() - /// - Cleared at the end of call_guest_function_by_name_no_reset() - /// - kill() only stamps cancel_requested if call_active is true - /// - If kill() is called when call_active=false, it returns false and has no effect - /// - /// # Why AtomicBool is Safe - /// - /// Although there's a theoretical race where: - /// 1. Thread A checks call_active (false) - /// 2. Thread B sets call_active (true) and starts guest call - /// 3. Thread A's kill() returns false (no effect) - /// - /// This is acceptable because the generation tracking provides an additional - /// safety layer. Even if a stale kill somehow stamped cancel_requested, the - /// generation mismatch would cause it to be ignored. - call_active: AtomicBool, - - /// Debugger interrupt request flag (GDB only). - /// - /// Set when kill_from_debugger() is called, cleared when VCPU stops running. - /// Used to distinguish debugger interrupts from normal kill() interrupts. - #[cfg(gdb)] - debug_interrupt: AtomicBool, - /// Whether the corresponding VM has been dropped. dropped: AtomicBool, - /// Delay between retry attempts when sending signals to the VCPU thread. + /// Delay between retry attempts when sending signals to interrupt the vcpu. retry_delay: Duration, - /// Offset from SIGRTMIN for the signal used to interrupt the VCPU thread. + /// Offset from SIGRTMIN for the signal used to interrupt the vcpu thread. sig_rt_min_offset: u8, } #[cfg(any(kvm, mshv3))] impl LinuxInterruptHandle { - fn send_signal(&self, stamp_generation: bool) -> bool { + const RUNNING_BIT: u8 = 1 << 1; + const CANCEL_BIT: u8 = 1 << 0; + #[cfg(gdb)] + const DEBUG_INTERRUPT_BIT: u8 = 1 << 2; + + /// Get the running, cancel and debug flags atomically. + /// + /// # Memory Ordering + /// Uses `Acquire` ordering to synchronize with the `Release` in `set_running()` and `kill()`. + /// This ensures that when we observe running=true, we also see the correct `tid` value. + fn get_running_cancel_debug(&self) -> (bool, bool, bool) { + let state = self.state.load(Ordering::Acquire); + let running = state & Self::RUNNING_BIT != 0; + let cancel = state & Self::CANCEL_BIT != 0; + #[cfg(gdb)] + let debug = state & Self::DEBUG_INTERRUPT_BIT != 0; + #[cfg(not(gdb))] + let debug = false; + (running, cancel, debug) + } + + fn send_signal(&self) -> bool { let signal_number = libc::SIGRTMIN() + self.sig_rt_min_offset as libc::c_int; let mut sent_signal = false; - let mut target_generation: Option = None; loop { - if !self.call_active.load(Ordering::Acquire) { - // No active call, so no need to send signal - break; - } + let (running, cancel, debug) = self.get_running_cancel_debug(); - let (running, generation) = self.get_running_and_generation(); - - // Stamp generation into cancel_requested if requested and this is the first iteration - // We stamp even when running=false to support killing during host function calls - // The generation tracking will prevent stale kills from affecting new calls - // Only stamp if a call is actually active (call_active=true) - if stamp_generation - && target_generation.is_none() - && self.call_active.load(Ordering::Acquire) - { - self.set_cancel_requested(generation); - target_generation = Some(generation); - } + // Check if we should continue sending signals + // Exit if not running OR if neither cancel nor debug_interrupt is set + let should_continue = running && (cancel || debug); - // If not running, we've stamped the generation (if requested), so we're done - // This handles the host function call scenario - if !running { + if !should_continue { break; } - match target_generation { - None => target_generation = Some(generation), - // prevent ABA problem - Some(expected) if expected != generation => break, - _ => {} - } - log::info!("Sending signal to kill vcpu thread..."); sent_signal = true; + // Acquire ordering to synchronize with the Release store in set_tid() + // This ensures we see the correct tid value for the currently running vcpu unsafe { libc::pthread_kill(self.tid.load(Ordering::Acquire) as _, signal_number); } @@ -849,42 +660,204 @@ impl LinuxInterruptHandle { } } +#[cfg(any(kvm, mshv3))] +impl InterruptHandleImpl for LinuxInterruptHandle { + fn set_tid(&self) { + // Release ordering to synchronize with the Acquire load of `running` in send_signal() + // This ensures that when send_signal() observes RUNNING_BIT=true (via Acquire), + // it also sees the correct tid value stored here + self.tid + .store(unsafe { libc::pthread_self() as u64 }, Ordering::Release); + } + + fn set_running(&self) { + // Release ordering to ensure that the tid store (which uses Release) + // is visible to any thread that observes running=true via Acquire ordering. + // This prevents the interrupt thread from reading a stale tid value. + self.state.fetch_or(Self::RUNNING_BIT, Ordering::Release); + } + + fn is_cancelled(&self) -> bool { + // Acquire ordering to synchronize with the Release in kill() + // This ensures we see the cancel flag set by the interrupt thread + self.state.load(Ordering::Acquire) & Self::CANCEL_BIT != 0 + } + + fn clear_cancel(&self) { + // Release ordering to ensure that any operations from the previous run() + // are visible to other threads. While this is typically called by the vcpu thread + // at the start of run(), the VM itself can move between threads across guest calls. + self.state.fetch_and(!Self::CANCEL_BIT, Ordering::Release); + } + + fn clear_running(&self) { + // Release ordering to ensure all vcpu operations are visible before clearing running + self.state.fetch_and(!Self::RUNNING_BIT, Ordering::Release); + } + + fn is_debug_interrupted(&self) -> bool { + #[cfg(gdb)] + { + self.state.load(Ordering::Acquire) & Self::DEBUG_INTERRUPT_BIT != 0 + } + #[cfg(not(gdb))] + { + false + } + } + + #[cfg(gdb)] + fn clear_debug_interrupt(&self) { + self.state + .fetch_and(!Self::DEBUG_INTERRUPT_BIT, Ordering::Release); + } + + fn set_dropped(&self) { + // Release ordering to ensure all VM cleanup operations are visible + // to any thread that checks dropped() via Acquire + self.dropped.store(true, Ordering::Release); + } +} + #[cfg(any(kvm, mshv3))] impl InterruptHandle for LinuxInterruptHandle { fn kill(&self) -> bool { - if !(self.call_active.load(Ordering::Acquire)) { - // No active call, so no effect - return false; - } + // Release ordering ensures that any writes before kill() are visible to the vcpu thread + // when it checks is_cancelled() with Acquire ordering + self.state.fetch_or(Self::CANCEL_BIT, Ordering::Release); - // send_signal will stamp the generation into cancel_requested - // right before sending each signal, ensuring they're always in sync - self.send_signal(true) + // Send signals to interrupt the vcpu if it's currently running + self.send_signal() } #[cfg(gdb)] fn kill_from_debugger(&self) -> bool { - self.debug_interrupt.store(true, Ordering::Relaxed); - self.send_signal(false) + self.state + .fetch_or(Self::DEBUG_INTERRUPT_BIT, Ordering::Release); + self.send_signal() } - fn dropped(&self) -> bool { - self.dropped.load(Ordering::Relaxed) + // Acquire ordering to synchronize with the Release in set_dropped() + // This ensures we see all VM cleanup operations that happened before drop + self.dropped.load(Ordering::Acquire) } } -#[cfg(any(kvm, mshv3))] -impl InterruptHandleInternal for LinuxInterruptHandle { - fn get_call_active(&self) -> &AtomicBool { - &self.call_active +#[cfg(target_os = "windows")] +#[derive(Debug)] +pub(super) struct WindowsInterruptHandle { + /// Atomic value packing vcpu execution state. + /// + /// Bit layout: + /// - Bit 2: DEBUG_INTERRUPT_BIT - set when debugger interrupt is requested + /// - Bit 1: RUNNING_BIT - set when vcpu is actively running + /// - Bit 0: CANCEL_BIT - set when cancellation has been requested + /// + /// `WHvCancelRunVirtualProcessor()` will return Ok even if the vcpu is not running, + /// which is why we need the RUNNING_BIT. + /// + /// CANCEL_BIT persists across vcpu exits/re-entries within a single `VirtualCPU::run()` call + /// (e.g., during host function calls), but is cleared at the start of each new `VirtualCPU::run()` call. + state: AtomicU8, + + partition_handle: windows::Win32::System::Hypervisor::WHV_PARTITION_HANDLE, + dropped: AtomicBool, +} + +#[cfg(target_os = "windows")] +impl WindowsInterruptHandle { + const RUNNING_BIT: u8 = 1 << 1; + const CANCEL_BIT: u8 = 1 << 0; + #[cfg(gdb)] + const DEBUG_INTERRUPT_BIT: u8 = 1 << 2; +} + +#[cfg(target_os = "windows")] +impl InterruptHandleImpl for WindowsInterruptHandle { + fn set_running(&self) { + // Release ordering to ensure prior memory operations are visible when another thread observes running=true + self.state.fetch_or(Self::RUNNING_BIT, Ordering::Release); + } + + fn is_cancelled(&self) -> bool { + // Acquire ordering to synchronize with the Release in kill() + // This ensures we see the CANCEL_BIT set by the interrupt thread + self.state.load(Ordering::Acquire) & Self::CANCEL_BIT != 0 + } + + fn clear_cancel(&self) { + // Release ordering to ensure that any operations from the previous run() + // are visible to other threads. While this is typically called by the vcpu thread + // at the start of run(), the VM itself can move between threads across guest calls. + self.state.fetch_and(!Self::CANCEL_BIT, Ordering::Release); + } + + fn clear_running(&self) { + // Release ordering to ensure all vcpu operations are visible before clearing running + self.state.fetch_and(!Self::RUNNING_BIT, Ordering::Release); + } + + fn is_debug_interrupted(&self) -> bool { + #[cfg(gdb)] + { + self.state.load(Ordering::Acquire) & Self::DEBUG_INTERRUPT_BIT != 0 + } + #[cfg(not(gdb))] + { + false + } } - fn get_running(&self) -> &AtomicU64 { - &self.running + #[cfg(gdb)] + fn clear_debug_interrupt(&self) { + self.state + .fetch_and(!Self::DEBUG_INTERRUPT_BIT, Ordering::Release); + } + + fn set_dropped(&self) { + // Release ordering to ensure all VM cleanup operations are visible + // to any thread that checks dropped() via Acquire + self.dropped.store(true, Ordering::Release); + } +} + +#[cfg(target_os = "windows")] +impl InterruptHandle for WindowsInterruptHandle { + fn kill(&self) -> bool { + use windows::Win32::System::Hypervisor::WHvCancelRunVirtualProcessor; + + // Release ordering ensures that any writes before kill() are visible to the vcpu thread + // when it checks is_cancelled() with Acquire ordering + self.state.fetch_or(Self::CANCEL_BIT, Ordering::Release); + + // Acquire ordering to synchronize with the Release in set_running() + // This ensures we see the running state set by the vcpu thread + let state = self.state.load(Ordering::Acquire); + if state & Self::RUNNING_BIT != 0 { + unsafe { WHvCancelRunVirtualProcessor(self.partition_handle, 0, 0).is_ok() } + } else { + false + } + } + #[cfg(gdb)] + fn kill_from_debugger(&self) -> bool { + use windows::Win32::System::Hypervisor::WHvCancelRunVirtualProcessor; + + self.state + .fetch_or(Self::DEBUG_INTERRUPT_BIT, Ordering::Release); + // Acquire ordering to synchronize with the Release in set_running() + let state = self.state.load(Ordering::Acquire); + if state & Self::RUNNING_BIT != 0 { + unsafe { WHvCancelRunVirtualProcessor(self.partition_handle, 0, 0).is_ok() } + } else { + false + } } - fn get_cancel_requested(&self) -> &AtomicU64 { - &self.cancel_requested + fn dropped(&self) -> bool { + // Acquire ordering to synchronize with the Release in set_dropped() + // This ensures we see all VM cleanup operations that happened before drop + self.dropped.load(Ordering::Acquire) } } diff --git a/src/hyperlight_host/src/sandbox/initialized_multi_use.rs b/src/hyperlight_host/src/sandbox/initialized_multi_use.rs index 2f83b1f5c..d66779c7e 100644 --- a/src/hyperlight_host/src/sandbox/initialized_multi_use.rs +++ b/src/hyperlight_host/src/sandbox/initialized_multi_use.rs @@ -47,47 +47,11 @@ use crate::mem::shared_mem::HostSharedMemory; use crate::metrics::{ METRIC_GUEST_ERROR, METRIC_GUEST_ERROR_LABEL_CODE, maybe_time_and_emit_guest_call, }; -use crate::{Result, log_then_return, new_error}; +use crate::{Result, log_then_return}; /// Global counter for assigning unique IDs to sandboxes static SANDBOX_ID_COUNTER: AtomicU64 = AtomicU64::new(0); -/// RAII guard that automatically calls `clear_call_active()` when dropped. -/// -/// This ensures that the call_active flag is always cleared when a guest function -/// call completes, even if the function returns early due to an error. -/// -/// Only one guard can exist per interrupt handle at a time - attempting to create -/// a second guard will return an error. -struct CallActiveGuard { - interrupt_handle: Arc, -} - -impl CallActiveGuard { - /// Creates a new guard and marks a guest function call as active. - /// - /// # Errors - /// - /// Returns an error if `call_active` is already true (i.e., another guard already exists). - fn new(interrupt_handle: Arc) -> Result { - // Atomically check that call_active is false and set it to true. - // This prevents creating multiple guards for the same interrupt handle. - let was_active = interrupt_handle.set_call_active(); - if was_active { - return Err(new_error!( - "Attempted to create CallActiveGuard when a call is already active" - )); - } - Ok(Self { interrupt_handle }) - } -} - -impl Drop for CallActiveGuard { - fn drop(&mut self) { - self.interrupt_handle.clear_call_active(); - } -} - /// A fully initialized sandbox that can execute guest functions multiple times. /// /// Guest functions can be called repeatedly while maintaining state between calls. @@ -611,10 +575,10 @@ impl MultiUseSandbox { if self.poisoned { return Err(crate::HyperlightError::PoisonedSandbox); } - // Mark that a guest function call is now active - // (This also increments the generation counter internally) - // The guard will automatically clear call_active when dropped - let _guard = CallActiveGuard::new(self.vm.interrupt_handle())?; + // ===== KILL() TIMING POINT 1 ===== + // Clear any stale cancellation from a previous guest function call or if kill() was called too early. + // Any kill() that completed (even partially) BEFORE this line has NO effect on this call. + self.vm.clear_cancel(); let res = (|| { let estimated_capacity = estimate_flatbuffer_capacity(function_name, &args); diff --git a/src/hyperlight_host/tests/integration_test.rs b/src/hyperlight_host/tests/integration_test.rs index 0d8aa271b..fc735974b 100644 --- a/src/hyperlight_host/tests/integration_test.rs +++ b/src/hyperlight_host/tests/integration_test.rs @@ -291,7 +291,7 @@ fn interrupt_moved_sandbox() { /// This will exercise the ABA-problem, where the vcpu could be successfully interrupted, /// but restarted, before the interruptor-thread has a chance to see that the vcpu was killed. /// -/// The ABA-problem is solved by introducing run-generation on the vcpu. +/// The ABA-problem is solved by clearing CANCEL bit at the start of each VirtualCPU::run() call. #[test] #[cfg(target_os = "linux")] fn interrupt_custom_signal_no_and_retry_delay() { @@ -1346,3 +1346,279 @@ fn interrupt_random_kill_stress_test() { println!("\n✅ All validations passed!"); } + +/// Ensures that `kill()` reliably interrupts a running guest +/// +/// The test works by: +/// 1. Guest calls a host function which waits on a barrier, ensuring the guest is "in-progress" and that `kill()` is not called prematurely to be ignored. +/// 2. Once the guest has passed that host function barrier, the host calls `kill()`. The `kill()` could be delivered at any time after this point, for example while guest is still in the host func, or returning into guest vm. +/// 3. The guest enters an infinite loop, so `kill()` is the only way to stop it. +/// +/// This is repeated across multiple threads and iterations to stress test the cancellation mechanism. +/// +/// **Failure Condition:** If this test hangs, it means `kill()` failed to stop the guest, leaving it spinning forever. +#[test] +fn interrupt_infinite_loop_stress_test() { + use std::sync::{Arc, Barrier}; + use std::thread; + + const NUM_THREADS: usize = 50; + const ITERATIONS_PER_THREAD: usize = 500; + + let mut handles = vec![]; + + for i in 0..NUM_THREADS { + handles.push(thread::spawn(move || { + // Create a barrier for 2 threads: + // 1. The guest (executing a host function) + // 2. The killer thread + let barrier = Arc::new(Barrier::new(2)); + let barrier_for_host = barrier.clone(); + + let mut uninit = new_uninit_rust().unwrap(); + + // Register a host function that waits on the barrier + uninit + .register("WaitForKill", move || { + barrier_for_host.wait(); + Ok(()) + }) + .unwrap(); + + let mut sandbox = uninit.evolve().unwrap(); + // Take a snapshot to restore after each kill + let snapshot = sandbox.snapshot().unwrap(); + + for j in 0..ITERATIONS_PER_THREAD { + let barrier_for_killer = barrier.clone(); + let interrupt_handle = sandbox.interrupt_handle(); + + // Spawn the killer thread + let killer_thread = std::thread::spawn(move || { + // Wait for the guest to call WaitForKill + barrier_for_killer.wait(); + + // The guest is now waiting on the barrier (or just finished waiting). + // We kill it immediately. + interrupt_handle.kill(); + }); + + // Call the guest function "CallHostThenSpin" which calls "WaitForKill" once then spins + // NOTE: If this test hangs, it means the guest was not successfully killed and is spinning forever. + // This indicates a bug in the cancellation mechanism. + let res = sandbox.call::<()>("CallHostThenSpin", "WaitForKill".to_string()); + + // Wait for killer thread to finish + killer_thread.join().unwrap(); + + // We expect the execution to be canceled + match res { + Err(HyperlightError::ExecutionCanceledByHost()) => { + // Success! + } + Ok(_) => { + panic!( + "Thread {} Iteration {}: Guest finished successfully but should have been killed!", + i, j + ); + } + Err(e) => { + panic!( + "Thread {} Iteration {}: Guest failed with unexpected error: {:?}", + i, j, e + ); + } + } + + // Restore the sandbox for the next iteration + sandbox.restore(&snapshot).unwrap(); + } + })); + } + + for handle in handles { + handle.join().unwrap(); + } +} + +// Validates that kill delivery stays accurate when a sandbox hops between threads +// mid-call and shares a thread ID with another sandbox, ensuring only the intended +// VM is interrupted while bait sandboxes keep running. +#[test] +fn interrupt_infinite_moving_loop_stress_test() { + use std::sync::Arc; + use std::thread; + + // We have a high thread count to stress test and to have interesting interleavings + const NUM_THREADS: usize = 200; + + let mut handles = vec![]; + + for _ in 0..NUM_THREADS { + handles.push(thread::spawn(move || { + let entered_guest = Arc::new(AtomicBool::new(false)); + let entered_guest_clone = entered_guest.clone(); + + let mut uninit = new_uninit_rust().unwrap(); + // Register a host function that waits on the barrier + uninit + .register("WaitForKill", move || { + entered_guest.store(true, Ordering::Relaxed); + Ok(()) + }) + .unwrap(); + let uninit2 = new_uninit_rust().unwrap(); + + // These 2 sandboxes will have the same TID + let sandbox = uninit.evolve().unwrap(); + let bait = uninit2.evolve().unwrap(); + + let interrupt = sandbox.interrupt_handle(); + + let kill_handle = thread::spawn(move || { + let entered_before_kill = entered_guest_clone.load(Ordering::Relaxed); + interrupt.kill(); + entered_before_kill + }); + + let mut sandbox_slot = Some(sandbox); + let mut bait_slot = Some(bait); + + // bait-sandbox should NEVER be interrupted which is why we can unwrap() + // sandbox may or may not be interrupted depending on whether `kill()` was called prematurely or not + let sandbox_res = match rand::random_range(..8u8) { + // sandbox on main thread, then bait on main thread + 0 => { + let res = sandbox_slot + .as_mut() + .unwrap() + .call::("Echo", "Real".to_string()); + bait_slot + .as_mut() + .unwrap() + .call::("Echo", "Bait".to_string()) + .expect("Bait call should never be interrupted"); + res + } + // bait on main thread, then sandbox on main thread + 1 => { + bait_slot + .as_mut() + .unwrap() + .call::("Echo", "Bait".to_string()) + .expect("Bait call should never be interrupted"); + sandbox_slot + .as_mut() + .unwrap() + .call::("Echo", "Real".to_string()) + } + // sandbox on spawned thread, bait on main thread + 2 => { + let mut sandbox = sandbox_slot.take().unwrap(); + let sandbox_handle = + thread::spawn(move || sandbox.call::("Echo", "Real".to_string())); + bait_slot + .as_mut() + .unwrap() + .call::("Echo", "Bait".to_string()) + .expect("Bait call should never be interrupted"); + sandbox_handle.join().unwrap() + } + // bait on spawned thread, sandbox on main thread + 3 => { + let mut bait = bait_slot.take().unwrap(); + let bait_handle = thread::spawn(move || { + bait.call::("Echo", "Bait".to_string()) + .expect("Bait call should never be interrupted"); + }); + let res = sandbox_slot + .as_mut() + .unwrap() + .call::("Echo", "Real".to_string()); + bait_handle.join().unwrap(); + res + } + // sandbox on main thread, bait on spawned thread + 4 => { + let res = sandbox_slot + .as_mut() + .unwrap() + .call::("Echo", "Real".to_string()); + let mut bait = bait_slot.take().unwrap(); + let bait_handle = thread::spawn(move || { + bait.call::("Echo", "Bait".to_string()) + .expect("Bait call should never be interrupted"); + }); + bait_handle.join().unwrap(); + res + } + // bait on main thread, sandbox on spawned thread + 5 => { + bait_slot + .as_mut() + .unwrap() + .call::("Echo", "Bait".to_string()) + .expect("Bait call should never be interrupted"); + let mut sandbox = sandbox_slot.take().unwrap(); + let sandbox_handle = + thread::spawn(move || sandbox.call::("Echo", "Real".to_string())); + sandbox_handle.join().unwrap() + } + // sandbox on spawned thread, bait on spawned thread + 6 => { + let mut sandbox = sandbox_slot.take().unwrap(); + let sandbox_handle = + thread::spawn(move || sandbox.call::("Echo", "Real".to_string())); + let mut bait = bait_slot.take().unwrap(); + let bait_handle = thread::spawn(move || { + bait.call::("Echo", "Bait".to_string()) + .expect("Bait call should never be interrupted"); + }); + bait_handle.join().unwrap(); + sandbox_handle.join().unwrap() + } + // bait on spawned thread, sandbox on spawned thread (spawn bait first) + 7 => { + let mut bait = bait_slot.take().unwrap(); + let bait_handle = thread::spawn(move || { + bait.call::("Echo", "Bait".to_string()) + .expect("Bait call should never be interrupted"); + }); + let mut sandbox = sandbox_slot.take().unwrap(); + let sandbox_handle = + thread::spawn(move || sandbox.call::("Echo", "Real".to_string())); + bait_handle.join().unwrap(); + sandbox_handle.join().unwrap() + } + _ => unreachable!(), + }; + + let entered_before_kill = kill_handle.join().unwrap(); + + // If the guest entered before calling kill, then we know for a fact the call should have been canceled since kill() was NOT premature. + if entered_before_kill { + assert!(matches!( + sandbox_res, + Err(HyperlightError::ExecutionCanceledByHost()) + )); + } + + // If we did NOT enter the guest before calling kill, then the call may or may not have been canceled depending on timing. + match sandbox_res { + Err(HyperlightError::ExecutionCanceledByHost()) => { + // OK! + } + Ok(_) => { + // OK! + } + Err(e) => { + panic!("Got unexpected error: {:?}", e); + } + } + })); + } + + for handle in handles { + handle.join().unwrap(); + } +} diff --git a/src/tests/rust_guests/simpleguest/src/main.rs b/src/tests/rust_guests/simpleguest/src/main.rs index e2a632a9d..9e0a2a08b 100644 --- a/src/tests/rust_guests/simpleguest/src/main.rs +++ b/src/tests/rust_guests/simpleguest/src/main.rs @@ -1097,6 +1097,14 @@ pub extern "C" fn hyperlight_main() { ); register_function(host_call_loop_def); + let call_host_then_spin_def = GuestFunctionDefinition::new( + "CallHostThenSpin".to_string(), + Vec::from(&[ParameterType::String]), + ReturnType::Void, + call_host_then_spin as usize, + ); + register_function(call_host_then_spin_def); + let print_using_printf_def = GuestFunctionDefinition::new( "PrintUsingPrintf".to_string(), Vec::from(&[ParameterType::String]), @@ -1639,6 +1647,23 @@ fn host_call_loop(function_call: &FunctionCall) -> Result> { } } +// Calls the given host function (no param, no return value) and then spins indefinitely. +fn call_host_then_spin(function_call: &FunctionCall) -> Result> { + if let ParameterValue::String(host_func_name) = &function_call.parameters.as_ref().unwrap()[0] { + call_host_function::<()>(host_func_name, None, ReturnType::Void)?; + #[expect( + clippy::empty_loop, + reason = "This function is used to keep the CPU busy" + )] + loop {} + } else { + Err(HyperlightGuestError::new( + ErrorCode::GuestFunctionParameterTypeMismatch, + "Invalid parameters passed to call_host_then_spin".to_string(), + )) + } +} + // Interprets the given guest function call as a host function call and dispatches it to the host. fn fuzz_host_function(func: FunctionCall) -> Result> { let mut params = func.parameters.unwrap(); diff --git a/src/tests/rust_guests/witguest/Cargo.lock b/src/tests/rust_guests/witguest/Cargo.lock index 4a0de8cf2..d566e8a65 100644 --- a/src/tests/rust_guests/witguest/Cargo.lock +++ b/src/tests/rust_guests/witguest/Cargo.lock @@ -378,9 +378,9 @@ dependencies = [ [[package]] name = "quote" -version = "1.0.41" +version = "1.0.42" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "ce25767e7b499d1b604768e7cde645d14cc8584231ea6b295e9c9eb22c02e1d1" +checksum = "a338cc41d27e6cc6dce6cefc13a0729dfbb81c262b1f519331575dd80ef3067f" dependencies = [ "proc-macro2", ] @@ -499,9 +499,9 @@ dependencies = [ [[package]] name = "syn" -version = "2.0.108" +version = "2.0.110" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "da58917d35242480a05c2897064da0a80589a2a0476c9a3f2fdc83b53502e917" +checksum = "a99801b5bd34ede4cf3fc688c5919368fea4e4814a4664359503e6015b280aea" dependencies = [ "proc-macro2", "quote", @@ -550,9 +550,9 @@ checksum = "06abde3611657adf66d383f00b093d7faecc7fa57071cce2578660c9f1010821" [[package]] name = "wasmparser" -version = "0.240.0" +version = "0.241.2" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "b722dcf61e0ea47440b53ff83ccb5df8efec57a69d150e4f24882e4eba7e24a4" +checksum = "46d90019b1afd4b808c263e428de644f3003691f243387d30d673211ee0cb8e8" dependencies = [ "bitflags", "hashbrown",