-
Notifications
You must be signed in to change notification settings - Fork 170
Enable guest time #1422
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Enable guest time #1422
Changes from all commits
63b7b52
65c1074
49d073d
432b6b3
420dca4
c3ec7db
13abd7d
3c14042
1b01ea1
fca0261
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,135 @@ | ||
| # Paravirtualized Guest Clock | ||
|
|
||
| Hyperlight's `enable_guest_clock` Cargo feature gives guests a cheap way to ask | ||
| "what time is it?" without taking a VM exit. When the host is built with the | ||
| feature, every sandbox exposes a paravirtualized clock that the guest can read | ||
| using ordinary memory loads. | ||
|
|
||
| ## What the guest gets | ||
|
|
||
| When the feature is enabled the host populates a single 4 KiB "clock page" | ||
| inside the sandbox's scratch region. The page carries two pieces of | ||
| information: | ||
|
|
||
| - **A hypervisor-specific calibration block at offset `0x00`.** Written by | ||
| KVM (`kvm_clock`) or Hyper-V / MSHV (Reference TSC). Contains the TSC | ||
| frequency, scaling constants, and a sequence lock the guest uses to read it | ||
| atomically. The entire clock page is hypervisor-owned; Hyperlight does not | ||
| write to it. | ||
| - **Hyperlight metadata in the scratch bookkeeping page** (separate from the | ||
| clock page): a `u64` [`ClockType`](../src/hyperlight_common/src/time.rs) tag | ||
| and `boot_time_ns`, the Unix-epoch origin of the monotonic clock computed | ||
| by the host as `wall_now - monotonic_now` (see below). These live at fixed | ||
| offsets from the top of scratch (`-0x28` and `-0x30`), NOT in the clock | ||
| page, so a future TLFS extension cannot clobber them. | ||
|
|
||
| With those two pieces the guest can compute: | ||
|
|
||
| - **Monotonic nanoseconds since boot** — read the TSC, apply the scaling | ||
| factors from the calibration block, giving you a `CLOCK_MONOTONIC` | ||
| equivalent. | ||
| - **Wall-clock nanoseconds since the Unix epoch** — add `boot_time_ns` to the | ||
| monotonic value above, giving you a `CLOCK_REALTIME` / `gettimeofday`. `boot_time_ns` is computed by the host as | ||
| `SystemTime::now() - KVM_GET_CLOCK` (on KVM) or | ||
| `SystemTime::now() - TIME_REF_COUNT` (on Hyper-V) after sandbox | ||
| initialisation. Hyper-V has no equivalent to KVM's | ||
| `MSR_KVM_WALL_CLOCK_NEW`, so we use this uniform host-computed approach | ||
| on all backends. | ||
|
|
||
| > **Note (KVM only):** Wall-clock time returns `None` during | ||
| > `hyperlight_main` (guest init). On KVM, `KVM_GET_CLOCK` is unreliable | ||
| > until the "master clock" is established at first vCPU entry, so | ||
| > `boot_time_ns` is stamped after init completes. Monotonic time works | ||
| > fine during init. Wall-clock time becomes available on the first | ||
| > dispatch call. | ||
|
|
||
| Both reads are lock-free (well, seqlock-protected for the calibration block) | ||
| and never leave the guest. | ||
|
|
||
| ## Using it in a Rust guest | ||
|
|
||
| The guest-side API lives in `hyperlight_guest::time` for the low-level | ||
| readers and `hyperlight_guest_bin::time` for a `std::time`-flavoured | ||
| wrapper: | ||
|
|
||
| ```rust | ||
| // Low-level, no_std readers. | ||
| use hyperlight_guest::time; | ||
|
|
||
| if time::is_available() { | ||
| let mono_ns: u64 = time::monotonic_time_ns().unwrap(); | ||
| let wall_ns: u64 = time::wall_clock_time_ns().unwrap(); | ||
| } | ||
|
|
||
| // std::time-flavoured wrapper (hyperlight_guest_bin only). | ||
| use hyperlight_guest_bin::time::{Instant, SystemTime, UNIX_EPOCH}; | ||
|
|
||
| let t0 = Instant::now()?; | ||
| // ... do work ... | ||
| let elapsed = t0.elapsed()?; | ||
|
|
||
| let now = SystemTime::now()?; | ||
| let unix_ns = now.duration_since(UNIX_EPOCH)?.as_nanos(); | ||
| ``` | ||
|
|
||
| C guests that use picolibc get paravirt time for free: `hyperlight_guest_bin` | ||
| wires `clock_gettime(CLOCK_MONOTONIC|CLOCK_REALTIME)` and `gettimeofday` into | ||
| the same reader, so existing C code continues to work unchanged. | ||
|
|
||
| ## Snapshot / restore semantics | ||
|
|
||
| Both `boot_time_ns` and the hypervisor calibration block live inside scratch | ||
| memory, which is not included in snapshots. On every | ||
| `MultiUseSandbox::restore`, the host re-arms the clock page: it re-installs | ||
| the pvclock MSR / Hyper-V register against the fresh vCPU state and stamps a | ||
| new `boot_time_ns` captured at the moment of restore. As a result a restored | ||
| guest observes wall-clock time reflecting the restore moment, not the | ||
| original boot — which is what wall clocks are supposed to do. | ||
|
|
||
| ## Enabling the feature | ||
|
|
||
| Turn it on in the host's `Cargo.toml`: | ||
|
|
||
| ```toml | ||
| [dependencies] | ||
| hyperlight-host = { version = "...", features = ["enable_guest_clock"] } | ||
| ``` | ||
|
|
||
| The feature is x86_64 only; on aarch64 it has no effect. It is off by default | ||
| so existing sandboxes don't pay for a facility they don't use. When off, the | ||
| clock page is still reserved in the layout (so memory maps are stable) but | ||
| left un-mapped against any hypervisor clock source; `hyperlight_guest::time` | ||
| readers then report "unavailable" and fall back to whatever the guest wants | ||
| to do about it (the picolibc wiring returns a synthetic 1-second-per-call | ||
| counter, which is enough to stop `strftime` crashing and not much else). | ||
|
|
||
| ## Layout details | ||
|
|
||
| The clock page sits 3 pages below the very top of the scratch region: | ||
|
|
||
| | Offset from top | Size | Contents | | ||
| |-----------------|-------|------------------------------------------------| | ||
| | `-0x1000` | 4 KiB | Bookkeeping (size, allocator counter, ...) | | ||
| | `-0x2000` | 4 KiB | Reserved for shared-state counter | | ||
| | `-0x3000` | 4 KiB | Paravirtualized clock page | | ||
|
|
||
| Because the clock page is at the top of scratch, both the guest's main stack | ||
| and its IST1 (exception) stack are configured to start one page below the | ||
| clock page (at `MAX_GVA + 1 - SCRATCH_TOP_CLOCK_PAGE_OFFSET`) so stack writes | ||
| — including page-fault handlers running on IST1 — cannot clobber the trailer. | ||
| The allocator reserves the top three pages unconditionally so the memory map | ||
| stays identical whether or not the feature is enabled. | ||
|
|
||
| ## Non-goals | ||
|
|
||
| - **Sub-microsecond accuracy.** `boot_time_ns` is computed from two | ||
| back-to-back host reads (`SystemTime::now()` and `KVM_GET_CLOCK` / | ||
| `TIME_REF_COUNT`). On KVM, residual disagreement between `KVM_GET_CLOCK` | ||
| and the pvclock page can add up to ~13ms of constant offset (observed on | ||
| WSL2; root cause uncertain). On Hyper-V the offset should be negligible. | ||
| - **`CLOCK_PROCESS_CPUTIME_ID` and friends.** The clock page exposes only | ||
| monotonic and wall-clock time; per-thread / per-process CPU time is out of | ||
| scope. | ||
| - **Timers or sleeps.** The guest can read the clock but has no way to ask | ||
| the hypervisor to wake it up later — that is still done through the | ||
| existing guest-function call model. | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -38,7 +38,31 @@ pub const SCRATCH_TOP_SIZE_OFFSET: u64 = 0x08; | |
| pub const SCRATCH_TOP_ALLOCATOR_OFFSET: u64 = 0x10; | ||
| pub const SCRATCH_TOP_SNAPSHOT_PT_GPA_BASE_OFFSET: u64 = 0x18; | ||
| pub const SCRATCH_TOP_SNAPSHOT_GENERATION_OFFSET: u64 = 0x20; | ||
| pub const SCRATCH_TOP_EXN_STACK_OFFSET: u64 = 0x30; | ||
|
|
||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What happened to |
||
| /// Offset from the top of scratch for the `clock_type` field (u64). | ||
| /// | ||
| /// Identifies which paravirtualized clock the host configured | ||
| /// ([`crate::time::ClockType`]). Lives in the bookkeeping page at the | ||
| /// top of scratch — NOT in the clock page itself — so the hypervisor | ||
| /// cannot clobber it if it extends the TLFS-reserved region. | ||
| pub const SCRATCH_TOP_CLOCK_TYPE_OFFSET: u64 = 0x28; | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm just curious, is top scratch now a goto mechanism to share configuration between host and guest replacing PEB? |
||
|
|
||
| /// Offset from the top of scratch for the `boot_time_ns` field (u64). | ||
| /// | ||
| /// The Unix-epoch origin of the monotonic clock, computed by the host | ||
| /// as `SystemTime::now() - current_monotonic_ns()` and written in | ||
| /// `arm_clock`. The guest recovers wall time as | ||
| /// `boot_time_ns + monotonic_time_ns()`. | ||
| /// | ||
| /// Hyper-V has no equivalent to KVM's `MSR_KVM_WALL_CLOCK_NEW`, so | ||
| /// we use this uniform host-computed approach on all backends. | ||
| pub const SCRATCH_TOP_BOOT_TIME_NS_OFFSET: u64 = 0x30; | ||
|
|
||
| // ---- Next free offset in the bookkeeping page: 0x38 ---- | ||
| // When adding new host→guest shared fields, use the next multiple of | ||
| // 8 after the last offset above. All fields in this page are u64, | ||
| // little-endian, host-written and guest-read, and are excluded from | ||
| // snapshots because they live in scratch memory. | ||
|
|
||
| /// Offset from the top of scratch memory for a shared host-guest u64 counter. | ||
| /// | ||
|
|
@@ -49,12 +73,60 @@ pub const SCRATCH_TOP_EXN_STACK_OFFSET: u64 = 0x30; | |
| #[cfg(feature = "guest-counter")] | ||
| pub const SCRATCH_TOP_GUEST_COUNTER_OFFSET: u64 = 0x1008; | ||
|
|
||
| /// Offset from the top of scratch memory for the start of the paravirtualized | ||
| /// clock page. | ||
| /// | ||
| /// The clock page is a single 4 KiB page occupying the scratch offsets | ||
| /// `[0x3000, 0x2000)` from the top — i.e. one page lower than the | ||
| /// guest-counter page, to avoid the i686 frame-number issue that forces the | ||
| /// counter off the very last page (see [`SCRATCH_TOP_GUEST_COUNTER_OFFSET`]). | ||
| /// | ||
| /// The constant is the *high* (exclusive) offset; the page base is one page | ||
| /// below, at `top - SCRATCH_TOP_CLOCK_PAGE_OFFSET` + 1 byte — in other words, | ||
| /// subtract this value from `MAX_GPA`/`MAX_GVA` + 1 to get the page base. | ||
| /// | ||
| /// The page is always reserved regardless of the `enable_guest_clock` | ||
| /// feature so that the memory layout (and therefore stack positions) | ||
| /// is stable across feature-flag builds. The host only populates it | ||
| /// when the feature is enabled; otherwise it stays zero-filled and | ||
| /// the guest sees `ClockType::None`. | ||
| pub const SCRATCH_TOP_CLOCK_PAGE_OFFSET: u64 = 0x3000; | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Could we add static assertions that the address is properly aligned to store |
||
|
|
||
| /// Size of the paravirtualized clock page in bytes (one 4 KiB page). | ||
| /// The entire page is owned by the hypervisor (KVM pvclock or Hyper-V | ||
| /// Reference TSC). Hyperlight's own metadata (`clock_type`, | ||
| /// `boot_time_ns`) lives in the bookkeeping page at offsets | ||
| /// `SCRATCH_TOP_CLOCK_TYPE_OFFSET` / `SCRATCH_TOP_BOOT_TIME_NS_OFFSET`, | ||
| /// NOT in the clock page, so a future TLFS extension cannot clobber it. | ||
| pub const CLOCK_PAGE_SIZE: u64 = 0x1000; | ||
|
|
||
| pub fn scratch_base_gpa(size: usize) -> u64 { | ||
| (MAX_GPA - size + 1) as u64 | ||
| } | ||
| pub fn scratch_base_gva(size: usize) -> u64 { | ||
| (MAX_GVA - size + 1) as u64 | ||
| } | ||
|
|
||
| /// Guest physical address of the base of the paravirtualized clock page. | ||
| /// | ||
| /// The clock page sits at a fixed offset from the top of the guest physical | ||
| /// address space, independent of `scratch_size`: it is always | ||
| /// `MAX_GPA + 1 - SCRATCH_TOP_CLOCK_PAGE_OFFSET`. | ||
| /// | ||
| /// Only meaningful when the host is built with the `enable_guest_clock` | ||
| /// feature; otherwise the page is not populated. | ||
| pub const fn clock_page_gpa() -> u64 { | ||
| (MAX_GPA as u64) + 1 - SCRATCH_TOP_CLOCK_PAGE_OFFSET | ||
| } | ||
|
|
||
| /// Guest virtual address of the base of the paravirtualized clock page. | ||
| /// | ||
| /// See [`clock_page_gpa`]. Scratch is mapped identity-style from | ||
| /// `scratch_base_gva` to `scratch_base_gpa`, so the clock page sits at the | ||
| /// equivalent offset in the guest virtual address space. | ||
| pub const fn clock_page_gva() -> u64 { | ||
| (MAX_GVA as u64) + 1 - SCRATCH_TOP_CLOCK_PAGE_OFFSET | ||
| } | ||
|
|
||
| /// Compute the minimum scratch region size needed for a sandbox. | ||
| pub use arch::min_scratch_size; | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don's see the arch specific |
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is also good stopgap for many other things that expect
gettimeofday/clock_gettimeto work (like StarlingMonkey and quickjs)