[PAL/vm-common] Get memory regions from VMM using "etc/e820" fw_cfg file #28

dimakuv · 2024-05-07T18:03:51Z

Description of the changes

Previously on the VM PAL, we hard-coded a single memory region that is RAM (i.e., not reserved) via FW_CFG_RAM_SIZE. This is a dangerous assumption which may break in the future. This commit uses a proper source of E820 info: the etc/e820 fw_cfg file.

As a side effect, find_fw_cfg_selector() function was slightly modified to also return the size of the object. All callers of this function now use this returned size.

Fixes #27.

How to test this PR?

Run on newer QEMU.

This change is

Signed-off-by: Dmitrii Kuvaiskii <dmitrii.kuvaiskii@intel.com>

In case of VM and TDX PALs, Gramine actually expects apps to issue raw `syscall` instructions (because Gramine executes in ring-0). TODO: Same for musl, libgomp and any other syscall-patched subproject. Signed-off-by: Dmitrii Kuvaiskii <dmitrii.kuvaiskii@intel.com>

Signed-off-by: Dmitrii Kuvaiskii <dmitrii.kuvaiskii@intel.com>

Previously, all PALs performed relocations on the PAL binary. However, some PALs like TDX are loaded with all relocations done already, so there is no need to perform it again. Signed-off-by: Dmitrii Kuvaiskii <dmitrii.kuvaiskii@intel.com>

Previously, the stack started (to grow downwards) from the base address of the PAL binary (which is at 1.5MB or 0x180000). This stack space was not accounted for and could lead to memory corruptions (if e.g. the PAL would decide to allocate some objects there via _PalVirtualMemoryAlloc). Now the stack is part of the PAL binary and has a size of 16KB. Signed-off-by: Dmitrii Kuvaiskii <dmitrii.kuvaiskii@intel.com>

Previously, VM/TDX PALs had all memory pages present (P bit = 1) all the time, and they did a memset-to-zero of memory regions as soon as they were requested in `PalVirtualMemoryAlloc()`. This commit changes it to lazy memory allocation for anonymous memory: pages are marked as not present (P bit = 0) and memset-to-zero on demand, in the #PF exception handler. This is purely a performance optimization. Same for file-backed memory. As a side-effect of this change, the legacy "kernel stack pointer" in TSS is replaced with the newer "interrupt stack table" entry. This is because the "kernel stack pointer" changes to the dedicated stack `_sys_interrupt_stack` only on ring 3 -> ring 0 transitions, so if e.g. a #PF would happen while in ring 0, the interrupt would be processed on the same stack, and the pages of this "normal" stack could be not present in PTEs (leading to a double-fault: #PF during #PF handling). In contrast, IST entries switch the stack unconditionally. Signed-off-by: Dmitrii Kuvaiskii <dmitrii.kuvaiskii@intel.com>

Previously, the XCR0 control register was hard-coded to enable only x87, SEE and AVX features inside a VM. This led to suboptimal performance of workloads because the VM couldn't use e.g. AVX512 instructions. This commit fixes it by setting XCR0 dynamically at startup, based on the CPUID.(EAX=0DH,ECX=0) leaf. Signed-off-by: Dmitrii Kuvaiskii <dmitrii.kuvaiskii@intel.com>

Previously, virtio drivers assumed that the host VMM is malicious if it would return a `used_idx` in some virtio queue that is less than the VM-memoized last seen `used_idx`. However, this case is possible on int wrap around. Thus, this commit modifies the check to verify that the host-returned index is at most queue-size items ahead of the VM-memoized index, accounting for int wrap. Note that this check concerns only console and vsock drivers. The fs driver is not problematic because it always assumes an increment of 1 in the virtio RX queue (this increment accounts for possible int wrap). Signed-off-by: Dmitrii Kuvaiskii <dmitrii.kuvaiskii@intel.com>

VIRTIO v1.1 spec mentions that bits 2 to 31 are reserved, see Section 4.1.4.5 "ISR status capability". Let's proactively verify this. Signed-off-by: Dmitrii Kuvaiskii <dmitrii.kuvaiskii@intel.com>

Signed-off-by: Dmitrii Kuvaiskii <dmitrii.kuvaiskii@intel.com>

Inputs come from untrusted host VMM, via port I/O. Signed-off-by: Dmitrii Kuvaiskii <dmitrii.kuvaiskii@intel.com>

Signed-off-by: Dmitrii Kuvaiskii <dmitrii.kuvaiskii@intel.com>

This attack is only possible on the virtio-vsock TQ because this is the only place where our VM reads the to-be-freed descriptor index directly from the possibly malicious host VMM. Signed-off-by: Dmitrii Kuvaiskii <dmitrii.kuvaiskii@intel.com>

This commit refactors code for readability. It also adds more fail-fast checks on allowed PCI devices and their configurations. Buffer overflows and out-of-PCI-memory-region allocations are prevented. Signed-off-by: Dmitrii Kuvaiskii <dmitrii.kuvaiskii@intel.com>

Previously, all threads waiting on some event waited on a single global `g_waiting_events_futex` word. This means that when some event was triggered, then *all* waiting threads were woken up. This commit replaces this global variable with per-handle words, so only a specific thread is woken up on the event. Signed-off-by: Dmitrii Kuvaiskii <dmitrii.kuvaiskii@intel.com>

Among most important refactorings: - Two pipes shared a single buffer, which was a part of one of the two pipes. Now replaced with a single object that both pipes reference (with refcount = 2). - Pipes read and wrote in a for loop with one-byte-at-a-time, which was very inefficient (but done to easily handle buffer wrap around). Now the for loop is replaced with a while loop with memcpy. - All possible data races are protected with locks: all connection-related events and the `pipe_buf` field are protected with `g_connecting_pipes_lock`, and all `pipe_buf` object accesses are protected with `pipe_buf::lock`. Signed-off-by: Dmitrii Kuvaiskii <dmitrii.kuvaiskii@intel.com>

Previously, Gramine-VM and Gramine-TDX did not have a correct UNIX time on startup. They simply started with the timestamp of `0` (which in UNIX-time format means "1. January 1970"). This commit gets the UNIX time value on startup from the VMM using "FW CFG" feature of QEMU. The obtained time value is untrusted so checks are added to guarantee it's in a reasonable range. In the future, it should be obtained from a trusted time source, like a trusted remote server. Signed-off-by: Kailun Qin <kailun.qin@intel.com>

Previously, wait-with-timeout functions registered a timeout (in a global list) using `register_timeout()`. This func allocated a timeout object and added it to the list of pending timeouts. Another func `notify_about_timeouts_uninterruptable()` would remove the triggered timeout from the list and free the timeout object itself. However, if the wait-with-timeout function finishes before the timeout is triggered, then the timeout would become "dangling". In reality, when the wait-with-timeout function finishes, the timeout must be removed from the list and freed (i.e. by the function itself). This commit forces wait-with-timeout functions to take ownership of the timeout object, returned by `register_timeout()` and deregister this timeout upon finishing via `deregister_timeout()`. As a side effect, the two users of timeout -- `PalEventWait()` and `PalStreamsWaitEvents()` -- are refactored to have uniform timeout-handling code. Signed-off-by: Dmitrii Kuvaiskii <dmitrii.kuvaiskii@intel.com>

Previously, Gramine-TDX used `.manifest` file (same as Gramine-VM). However, for SGX-like protections like trusted & allowed files Gramine-TDX must use `.manifest.tdx`. But since Gramine's current building infrastructure creates `.manifest.sgx` files, fall back to using this file if `.manifest.tdx` wasn't found. This fall-back to `.manifest.sgx` is a temporary measure, until Gramine has a proper building steps for TDX. Signed-off-by: Dmitrii Kuvaiskii <dmitrii.kuvaiskii@intel.com>

The logic is copy-pasted from Linux-SGX PAL and modified: - no optimization of umem, - no `pal_handle::file::seekable` field, - `total` is renamed to `file_size`, - `off_t` is replaced with `int64_t`. Signed-off-by: Dmitrii Kuvaiskii <dmitrii.kuvaiskii@intel.com>

Now both `gramine-vm` and `gramine-tdx` are generated from a single `gramine-vm.in` bash script template. Signed-off-by: Dmitrii Kuvaiskii <dmitrii.kuvaiskii@intel.com>

`pal_socket_addr::port` is big-endian whereas `sockaddr_vm::port` is host-byte (little-endian on x86-64). So we need a conversion. Same for `pal_socket_addr::addr`. Signed-off-by: Dmitrii Kuvaiskii <dmitrii.kuvaiskii@intel.com>

Previously, RTMR2 was erroneously extended with `loader.entrypoint` (i.e. the LibOS binary). This was (a) insecure because we were reading this file untrusted and unchecked so a malicious host could replace this file later, and (b) irrelevant because the manifest file -- the actual root of trust -- was not reflected anywhere in TDX measurements. Signed-off-by: Dmitrii Kuvaiskii <dmitrii.kuvaiskii@intel.com>

Previously, Gramine-VM and Gramine-TDX were always started with 8GB of RAM. But in reality it can be set to smaller or greater values. This commit adds support for configurable QEMU VM memory size. We first try to get the value from `.manifest.tdx` file, by parsing `tdx.enclave_size` value. If this fails, we try alternatively with `.manifest.sgx` and `sgx.enclave_size`. If file parsing fails, we then try to get the value from environment variable `GRAMINE_RAM_SIZE`. If both file and envvar parsing fail, we eventually fall back to the default value 8GB. Signed-off-by: Kailun Qin <kailun.qin@intel.com>

…utdown Previously in `virtio_vsock_connect()` and `virtio_vsock_shutdown()`, we were doing busy loop to wait for the connection to reach the expected state. This commit waits and sleeps on a timeout instead of the busy loop. Signed-off-by: Kailun Qin <kailun.qin@intel.com>

Otherwise, shutdown and close callbacks would print annoying messages. Signed-off-by: Dmitrii Kuvaiskii <dmitrii.kuvaiskii@intel.com>

Currently Gramine-VM/TDX is single-core. So always return a single (first) CPU core in the CPU mask, in response to CPU get affinity. Previous logic returned success without updating the CPU mask. This led to applications getting an empty CPU mask, which is unexpected and e.g. resulted in Blender raising division-by-zero exception. Signed-off-by: Dmitrii Kuvaiskii <dmitrii.kuvaiskii@intel.com>

All `virtio_fs_fuse_*` functions were analyzed and hardened where required. Also, several bugs were fixed in VM/TDX PAL FS code. Signed-off-by: Dmitrii Kuvaiskii <dmitrii.kuvaiskii@intel.com>

Previously, VM/TDX PALs hard-coded page tables to start at 146MB. If Gramine app was an EXEC ELF binary, then this binary would start at 4MB and thus could have the size no more than 142MB. However, some apps like Blender exceed this size limit, and Gramine failed to map such binaries. To fix this and allow for bigger binaries, this commit moves hard-coded memory regions higher: page tables start at 512MB, shared virtq memory starts at 648MB. To achieve this, we had to modify page table initialization for VM PAL to become three-phased: first the bootloader installs static page tables to cover 32MB of memory, then we install temporary page tables to cover 1GB of memory, and finally we install proper page tables to cover the whole available to VM memory. Note that this three-phased init is not needed for TDX PAL: TD-Shim already installs initial page tables that cover [0..4GB) of memory. This commit also moves the PAL binary to start at 1MB in VM case. Previously it started at 1.5MB, for which there was no reason. Also, the PAL binary in VM case became smaller in size because static page tables are much smaller (reduced from 128 PTs to 16PTs). Signed-off-by: Dmitrii Kuvaiskii <dmitrii.kuvaiskii@intel.com>

Previuosly, file mmap callback did not call `memory_alloc()` to request a memory region to be allocated into which the file contents will be copied. This was a bug that worked before all pages are currently marked as Present and RWX, so `memory_alloc()` is currently a no-op (other than zeroing out the memory). This bug was uncovered when implementing proper page permissions for VM-based PALs. This commit adds a required call to `memory_alloc()`. Note that this leads to a slight performance overhead, as we now redundantly memset the newly-allocated memory region to all-zeros (and then immediately overwrite with file contents). However, the file-mmap operation is sufficiently rare, and correct memory management is more important than a small perf overhead. Signed-off-by: Dmitrii Kuvaiskii <dmitrii.kuvaiskii@intel.com>

Previously, fpregs memory region (aka XSAVE region) was allocated in a separate `malloc()` operation when a thread was created, and it was deallocated via `free()` during thread exit. The deallocation via `free()` actually went through the `g_mem_bkeep_free_upcall()` LibOS upcall path, which could go back into PAL and ultimately call `sched_thread_wait()`. At this point, the thread may *not* have a valid fpregs region anymore, but the thread will still be de-scheduled and will try to save fpregs into this non-existing region, leading to an unrecoverable page fault. This bug was uncovered when implementing proper page permissions for VM-based PALs. This commit fixes it by allocating fpregs memory region together with the thread stack, so that it is never freed but instead reused by future threads. This has a side benefit of making the thread creation/exit code cleaner. Signed-off-by: Dmitrii Kuvaiskii <dmitrii.kuvaiskii@intel.com>

Previously, when trying to send a normal packet on vsock TX queue that is full, we bailed out immediately and dropped this packet. This commit adds a retry -- if the queue is full, then we drain it first, then try to send the packet again, and only after the second failed attempt we bail out (and print a warning that we dropped the packet). Signed-off-by: Dmitrii Kuvaiskii <dmitrii.kuvaiskii@intel.com>

Previously, VM-based PALs marked all pages as Present, RWX and accessible from both userspace and kernelspace (see `pagetables_init()` in common VM code). This commit introduces proper page permissions: - After boot time, special memory regions like Shared Memory and Page Tables are marked as kernelspace-only and non-execute. All unused memory regions are marked as not present (not accessible). - At run time, memory management routines take page permissions into account. In the multicore case, the affected memory region must be "TLB invalidated" on all cores (vCPUs). To this end, inter-processor interrupts (IPIs) are used: a vCPU that wishes to change a page mapping writes the operation to a global struct and sends an interrupt 33 to every other vCPU. Each vCPU receives the interrupt, acknowledges the IPI by writing to a shared variable and invalidates the TLB entries. After all vCPUS acknowledged the IPI, the initiating vCPU finalizes the "invalidate TLB" protocol. Signed-off-by: Dmitrii Kuvaiskii <dmitrii.kuvaiskii@intel.com>

Previously, we hard-coded the `guest-cid` (CID) in virtio-vsock to `10` and the UNIX socket path for `vhostfs` to `/tmp/vhostfs`. This made running two instances of `gramine-vm` or `gramine-tdx` at the same time impossible, with errors like these: - from `virtio-vsock`: `unable to set guest cid: Address already in use` - from `virtiofsd`: `Error creating pid file '/tmp/vhostfs.pid'` This commit introduces the `GRAMINE_VM_ID` variable used to configure the `guest-cid` and the path for `virtiofs`. `GRAMINE_VM_ID` is determined based on the already used `guest-cid`s. We scan the running QEMU processes and identify the `guest-cid`s of the running VMs. Starting from our default value (`10`), we search for the next available `guest-cid`. In this way, we avoid collisions between the instances. Signed-off-by: Dimitrios Stavrakakis <dimitrios.stavrakakis@intel.com>

Previously, we set whatever permission the user requested on a file-backed mmap immediately. This is wrong: the subsequent code will update the mmapped region with file contents, so first the region must be mmapped with write permission too, and only after the update the write permission must be revoked (if read-only was requested by user). Signed-off-by: Dmitrii Kuvaiskii <dmitrii.kuvaiskii@intel.com>

Previously, the `shutdown()` operation (i.e., `VIRTIO_VSOCK_OP_SHUTDOWN` vsock operation) was incorrectly emulated as the `close()` operation. In particular, our code didn't take into account the RCV/SEND/COMPLETE flag of the SHUTDOWN operation. Also, the SHUTDOWN operation moved the connection state to "closed", which prevented any further operations like read/write on this connection, even if the shutdown was only partial. Also, a benign RST reply on the SHUTDOWN operation wasn't ignored (as it should be), but instead moved the connection state to "closed". This commit fixes all the above issues. Now we keep the connection opened in case of a partial shutdown, and we don't panic on benign RST packets. This fixes workloads like lighttpd + wrk. Signed-off-by: Dmitrii Kuvaiskii <dmitrii.kuvaiskii@intel.com>

Previously, our virtio drivers didn't use suppression of notifications at all. This commit adds the following suppression mechanisms: - Receiving inputs on RQ of virtio-console is wrapped into "disable interrupts" loop. - virtio-console sends notification on consumed inputs on RQ only if the device didn't set `VIRTQ_USED_F_NO_NOTIFY` flag. - virtio-console sends notification on prepared outputs on TQ only if the device didn't set `VIRTQ_USED_F_NO_NOTIFY` flag. - virtio-console disables interrupts on TQ unconditionally, because it performs its own clean-up of TQ. - virtio-vsock has exactly the same suppressions as in virtio-console. - virtio-fs disables interrupts on the `requests` queue unconditionally and instead always performs busy poll. Notice that we do not use the VIRTIO_F_RING_EVENT_IDX feature, nor the MSI-X capability. We use classic VRING_AVAIL_F_NO_INTERRUPT and VIRTQ_USED_F_NO_NOTIFY bits in the `avail->flags` and `used->flags` of virtio queues correspondingly. Signed-off-by: Dmitrii Kuvaiskii <dmitrii.kuvaiskii@intel.com>

Previously, VM and TDX PALs tried to use virtio-console which may be not yet initialized during early boot. For this early-boot phase, use Port I/O which is always available. Signed-off-by: Dmitrii Kuvaiskii <dmitrii.kuvaiskii@intel.com>

These are tiny fixes as Clang is stricter with suspicious programming practices as GCC. Signed-off-by: Dmitrii Kuvaiskii <dmitrii.kuvaiskii@intel.com>

Helper threads don't use signal alt stack, so `thread_helper_create()` didn't allocate space for it. However, a generic `thread_setup()` expects this alt stack and memsets it to zero. This commit fixes this buffer overflow by allocating the alt stack for helper threads. This bug was found by Address Sanitizer. Signed-off-by: Dmitrii Kuvaiskii <dmitrii.kuvaiskii@intel.com>

…f()` This commit replaces calls to internal `memory_mark_pages_off()` with calls to a proper API function `memory_protect()`. Signed-off-by: Dmitrii Kuvaiskii <dmitrii.kuvaiskii@intel.com>

With a previous (bogus) declaration of the `idt_start` assembly-defined variable, newer GCC complains about "array subscript XX is outside array bounds of char[1] (-Warray-bounds)". Signed-off-by: Dmitrii Kuvaiskii <dmitrii.kuvaiskii@intel.com>

Functions declared in `kernel_debug.h` were ad-hoc and with bad naming (it clashes with Gramine's own `log_*` functions). Rename to avoid confusion and refactor slightly. Signed-off-by: Dmitrii Kuvaiskii <dmitrii.kuvaiskii@intel.com>

Previously, there was a bug in `pagetables_init()`: PDP entries may be "forgotten" for VMs with sizes not aligned to 1GB. For example, a VM with 1.5GB would generate two Page Directory Tables, but there will be only one PDP entry to the first of these two PDTs. This commit fixes this: we use the ceiling function instead of floor function during calculation of page tables' entries. Signed-off-by: Dmitrii Kuvaiskii <dmitrii.kuvaiskii@intel.com>

The transmit queue can become full of pending messages, each in its slot. The current size of the queue is 128, so if the app sends 128 (small) messages in quick succession, the queue is full and messages got dropped. This commit fixes this by draining the queue in such case. Signed-off-by: Dmitrii Kuvaiskii <dmitrii.kuvaiskii@intel.com>

A wrapper function is required to support callback functions that perform `return 0` instead of `PalThreadExit()` at the end of thread execution. Without this wrapper, the thread would jump off the stack upon `return 0`. This was found on PAL regression test `Thread2`. Signed-off-by: Dmitrii Kuvaiskii <dmitrii.kuvaiskii@intel.com>

Previously, the common VM code had a bug of sending a shutdown packet even on a listening socket during `close(listening-socket)`. This is incorrect: listening sockets don't have a connection so there is nothing to shutdown, they are just immediately destroyed. Signed-off-by: Dmitrii Kuvaiskii <dmitrii.kuvaiskii@intel.com>

…ailure Previously the pipes code incorrectly returned `-PAL_ERROR_DENIED`. Signed-off-by: Dmitrii Kuvaiskii <dmitrii.kuvaiskii@intel.com>

Also fix `IPV6_V6ONLY` support, as it missed some corner cases. Signed-off-by: Dmitrii Kuvaiskii <dmitrii.kuvaiskii@intel.com>

Current futex triggering has a data race: if the thread that is supposed to wait on poll/epoll (emulated in PAL via `PalStreamsWaitEvents()`) is currently running, and another thread (on another CPU) triggers the polled-on futex, this event is lost. This is because all polls wait on a single global `g_streams_waiting_events_futex`, so if two threads (e.g. Async and IPC threads) poll at the same time, then only one of them notices the wake up, and the futex is considered triggered and not reflected in another (currently running) thread on the next poll wait. This may lead to hangs, e.g., in the IPC thread that waits without any timeout. We work around this peculiarity of current `PalStreamsWaitEvents()` implementation by periodically triggerring this global futex, in particular, during timer interrupts on CPU0. A proper fix would be to introduce a triggering mechanism that uses fine-grained futexes. Signed-off-by: Dmitrii Kuvaiskii <dmitrii.kuvaiskii@intel.com>

If Gramine-VM or -TDX is built with Address Sanitizer (ASan), we reserve 1/8th of the total VM address space (which includes the memory and PCI holes at [2GB, 4GB)) to ASan shadow memory. We carve out this 1/8th of the space at the very end of the usable VM address space: - If VM has 8GB of RAM, then ASan shadow memory needs 1GB: [7G, 8G). - Current max amount of RAM for VM is 64GB, which roughly corresponds to 8GB of ASan shadow memory. We force VM size to be at least 8GB for simplicity: to avoid overlapping with reserved and MMIO memory regions (e.g., shared memory region, page tables region, PCI MMIO region). Otherwise for example, if VM size would be 4GB, then ASan shadow memory could not be put at [3.5GB, 4GB) as it overlaps with the PCI MMIO region. In addition, ASan shadow memory must be reflected in ASan-specific page tables. These page tables do not have 1:1 mapping, as ASan shadow memory has virtual addresses [1.5TB, 1.5TB + 8GB). Thus, these page tables must create virtual-to-physical mapping as: 1.5TB + offset -> ASAN_SHADOW_PHYSICAL_START + offset Also, the "normal" page tables must disable "normal" virtual mappings to these physical-shadow memory pages (otherwise normal memory would alias ASan shadow memory). Page tables' memory region is slightly increased from [512MB, 648MB) to [512MB, 658MB), to be able to fit additional page tables for ASan shadow memory. Signed-off-by: Dmitrii Kuvaiskii <dmitrii.kuvaiskii@intel.com>

Signed-off-by: Dmitrii Kuvaiskii <dmitrii.kuvaiskii@intel.com>

The host uses `fwd_cnt` and `buf_alloc` obtained in the guest's vsock packets for its buffer management (deciding whether to send a next packet to guest or back off). We previously incorrectly implemented this by making the two variables global for all virtual sockets. In reality, these variables are per-connection (per-socket). This mismatch between what the host expects and what Gramine guest implements led to hangs because on new connections, Gramine reported too many "bytes received" in `fwd_cnt` counter, leading the host to believe there are too many packets in flight on the new connection and to back off constantly. Fix by moving the vars to fields of `struct virtio_vsock_connection`. An additional benefit is that we already have proper locking for per-connection fields, so there is no need for atomics or special synchronization. Interestingly, Linux prior to v6.7 did not expose this Gramine bug because it itself had a bug of integer wrap around. This was fixed with commit torvalds/linux@60316d7f10b17a in v6.7. Signed-off-by: Dmitrii Kuvaiskii <dmitrii.kuvaiskii@intel.com>

dimakuv

Reviewable status: 0 of 5 files reviewed, 1 unresolved discussion

a discussion (no related file):
Blocking: must test and understand all reserved ranges on new QEMU/on different environments.

Previously on the VM PAL, we hard-coded a single memory region that is RAM (i.e., not reserved) via `FW_CFG_RAM_SIZE`. This is a dangerous assumption which may break in the future. This commit uses a proper source of E820 info: the `etc/e820` fw_cfg file. As a side effect, `find_fw_cfg_selector()` function was slightly modified to also return the size of the object. All callers of this function now use this returned size. Signed-off-by: Dmitrii Kuvaiskii <dmitrii.kuvaiskii@intel.com>

dimakuv

Reviewable status: 0 of 5 files reviewed, all discussions resolved

a discussion (no related file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Blocking: must test and understand all reserved ranges on new QEMU/on different environments.

Update: the problem was not related to some new reserved memory ranges on newer QEMU (v8.0).

The root cause was that docker run ... by default sets 64MB of /dev/shm. As soon as I set this limit to the one used by Gramine VMs, it started working: e.g. docker run --shm-size 4G.

dimakuv and others added 30 commits April 23, 2024 06:22

[LibOS] Small changes as preparation for VM PALs

d7db60a

Signed-off-by: Dmitrii Kuvaiskii <dmitrii.kuvaiskii@intel.com>

[Pal/VM] First minimal implementation

ce354e8

Signed-off-by: Dmitrii Kuvaiskii <dmitrii.kuvaiskii@intel.com>

[Pal/TDX] First minimal implementation

a537043

Signed-off-by: Dmitrii Kuvaiskii <dmitrii.kuvaiskii@intel.com>

[PAL/VM,TDX] Fail if reserved bits in ISR status reg are set

74ebb92

VIRTIO v1.1 spec mentions that bits 2 to 31 are reserved, see Section 4.1.4.5 "ISR status capability". Let's proactively verify this. Signed-off-by: Dmitrii Kuvaiskii <dmitrii.kuvaiskii@intel.com>

[PAL/VM,TDX] Fix bounds checks of virtio queues' notify addresses

59c37ee

Signed-off-by: Dmitrii Kuvaiskii <dmitrii.kuvaiskii@intel.com>

[PAL/VM,TDX] Sanitize QEMU's fw cfg inputs

3b55d03

Inputs come from untrusted host VMM, via port I/O. Signed-off-by: Dmitrii Kuvaiskii <dmitrii.kuvaiskii@intel.com>

[PAL/VM,TDX] Improve comment on unused ioapic_read_reg()

bb39b5e

Signed-off-by: Dmitrii Kuvaiskii <dmitrii.kuvaiskii@intel.com>

[PAL/VM,TDX] Combine VM and TDX gramine-* bash templates

231dda9

Now both `gramine-vm` and `gramine-tdx` are generated from a single `gramine-vm.in` bash script template. Signed-off-by: Dmitrii Kuvaiskii <dmitrii.kuvaiskii@intel.com>

[PAL/VM,TDX] Exit early if socket is already closed on shutdown

be2d362

Otherwise, shutdown and close callbacks would print annoying messages. Signed-off-by: Dmitrii Kuvaiskii <dmitrii.kuvaiskii@intel.com>

[PAL/VM,TDX] Harden and bugfix virtio-fs code

e2859b9

All `virtio_fs_fuse_*` functions were analyzed and hardened where required. Also, several bugs were fixed in VM/TDX PAL FS code. Signed-off-by: Dmitrii Kuvaiskii <dmitrii.kuvaiskii@intel.com>

dimakuv and others added 25 commits April 24, 2024 03:46

[PAL/vm-common] Fix compilation of VM-based PALs with Clang

93ca826

These are tiny fixes as Clang is stricter with suspicious programming practices as GCC. Signed-off-by: Dmitrii Kuvaiskii <dmitrii.kuvaiskii@intel.com>

[PAL/VM,TDX] Call memory_protect() instead of `memory_mark_pages_of…

7b9ba55

…f()` This commit replaces calls to internal `memory_mark_pages_off()` with calls to a proper API function `memory_protect()`. Signed-off-by: Dmitrii Kuvaiskii <dmitrii.kuvaiskii@intel.com>

[PAL/vm-common] pipes: Return -PAL_ERROR_CONNFAILED on connection f…

2682a74

…ailure Previously the pipes code incorrectly returned `-PAL_ERROR_DENIED`. Signed-off-by: Dmitrii Kuvaiskii <dmitrii.kuvaiskii@intel.com>

[PAL/vm-common] Add support for SO_REUSEPORT

dbb1bff

Also fix `IPV6_V6ONLY` support, as it missed some corner cases. Signed-off-by: Dmitrii Kuvaiskii <dmitrii.kuvaiskii@intel.com>

[PAL/vm-common] Fix top-level comment in memory management

0ce71c2

Signed-off-by: Dmitrii Kuvaiskii <dmitrii.kuvaiskii@intel.com>

[Docs] README: Add note about experimental state

2d81da2

Signed-off-by: Dmitrii Kuvaiskii <dmitrii.kuvaiskii@intel.com>

dimakuv commented May 7, 2024

View reviewed changes

dimakuv force-pushed the dimakuv/parse-etc-e820-fwcfg-file branch from a6cba2d to fb38207 Compare May 8, 2024 13:43

dimakuv commented May 8, 2024

View reviewed changes

dimakuv force-pushed the intel_tdx branch from 8598f92 to 183b6fb Compare July 1, 2024 14:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PAL/vm-common] Get memory regions from VMM using "etc/e820" fw_cfg file #28

[PAL/vm-common] Get memory regions from VMM using "etc/e820" fw_cfg file #28

dimakuv commented May 7, 2024 •

edited

Loading

dimakuv left a comment

dimakuv left a comment

[PAL/vm-common] Get memory regions from VMM using "etc/e820" fw_cfg file #28

Are you sure you want to change the base?

[PAL/vm-common] Get memory regions from VMM using "etc/e820" fw_cfg file #28

Conversation

dimakuv commented May 7, 2024 • edited Loading

Description of the changes

How to test this PR?

dimakuv left a comment

Choose a reason for hiding this comment

dimakuv left a comment

Choose a reason for hiding this comment

dimakuv commented May 7, 2024 •

edited

Loading