Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PAL/vm-common] Get memory regions from VMM using "etc/e820" fw_cfg file #28

Open
wants to merge 97 commits into
base: intel_tdx
Choose a base branch
from

Conversation

dimakuv
Copy link
Contributor

@dimakuv dimakuv commented May 7, 2024

Description of the changes

Previously on the VM PAL, we hard-coded a single memory region that is RAM (i.e., not reserved) via FW_CFG_RAM_SIZE. This is a dangerous assumption which may break in the future. This commit uses a proper source of E820 info: the etc/e820 fw_cfg file.

As a side effect, find_fw_cfg_selector() function was slightly modified to also return the size of the object. All callers of this function now use this returned size.

Fixes #27.

How to test this PR?

Run on newer QEMU.


This change is Reviewable

dimakuv and others added 30 commits April 23, 2024 06:22
Signed-off-by: Dmitrii Kuvaiskii <dmitrii.kuvaiskii@intel.com>
In case of VM and TDX PALs, Gramine actually expects apps to issue raw
`syscall` instructions (because Gramine executes in ring-0).

TODO: Same for musl, libgomp and any other syscall-patched subproject.

Signed-off-by: Dmitrii Kuvaiskii <dmitrii.kuvaiskii@intel.com>
Signed-off-by: Dmitrii Kuvaiskii <dmitrii.kuvaiskii@intel.com>
Signed-off-by: Dmitrii Kuvaiskii <dmitrii.kuvaiskii@intel.com>
Previously, all PALs performed relocations on the PAL binary. However,
some PALs like TDX are loaded with all relocations done already, so
there is no need to perform it again.

Signed-off-by: Dmitrii Kuvaiskii <dmitrii.kuvaiskii@intel.com>
Previously, the stack started (to grow downwards) from the base address
of the PAL binary (which is at 1.5MB or 0x180000). This stack space was
not accounted for and could lead to memory corruptions (if e.g. the PAL
would decide to allocate some objects there via _PalVirtualMemoryAlloc).
Now the stack is part of the PAL binary and has a size of 16KB.

Signed-off-by: Dmitrii Kuvaiskii <dmitrii.kuvaiskii@intel.com>
Previously, VM/TDX PALs had all memory pages present (P bit = 1) all the
time, and they did a memset-to-zero of memory regions as soon as they
were requested in `PalVirtualMemoryAlloc()`. This commit changes it to
lazy memory allocation for anonymous memory: pages are marked as not
present (P bit = 0) and memset-to-zero on demand, in the #PF exception
handler. This is purely a performance optimization. Same for file-backed
memory.

As a side-effect of this change, the legacy "kernel stack pointer" in
TSS is replaced with the newer "interrupt stack table" entry. This is
because the "kernel stack pointer" changes to the dedicated stack
`_sys_interrupt_stack` only on ring 3 -> ring 0 transitions, so if e.g.
a #PF would happen while in ring 0, the interrupt would be processed on
the same stack, and the pages of this "normal" stack could be not
present in PTEs (leading to a double-fault: #PF during #PF handling). In
contrast, IST entries switch the stack unconditionally.

Signed-off-by: Dmitrii Kuvaiskii <dmitrii.kuvaiskii@intel.com>
Previously, the XCR0 control register was hard-coded to enable only x87,
SEE and AVX features inside a VM. This led to suboptimal performance of
workloads because the VM couldn't use e.g. AVX512 instructions. This
commit fixes it by setting XCR0 dynamically at startup, based on the
CPUID.(EAX=0DH,ECX=0) leaf.

Signed-off-by: Dmitrii Kuvaiskii <dmitrii.kuvaiskii@intel.com>
Previously, virtio drivers assumed that the host VMM is malicious if it
would return a `used_idx` in some virtio queue that is less than the
VM-memoized last seen `used_idx`. However, this case is possible on int
wrap around. Thus, this commit modifies the check to verify that the
host-returned index is at most queue-size items ahead of the VM-memoized
index, accounting for int wrap.

Note that this check concerns only console and vsock drivers. The fs
driver is not problematic because it always assumes an increment of 1 in
the virtio RX queue (this increment accounts for possible int wrap).

Signed-off-by: Dmitrii Kuvaiskii <dmitrii.kuvaiskii@intel.com>
VIRTIO v1.1 spec mentions that bits 2 to 31 are reserved, see Section
4.1.4.5 "ISR status capability". Let's proactively verify this.

Signed-off-by: Dmitrii Kuvaiskii <dmitrii.kuvaiskii@intel.com>
Signed-off-by: Dmitrii Kuvaiskii <dmitrii.kuvaiskii@intel.com>
Inputs come from untrusted host VMM, via port I/O.

Signed-off-by: Dmitrii Kuvaiskii <dmitrii.kuvaiskii@intel.com>
Signed-off-by: Dmitrii Kuvaiskii <dmitrii.kuvaiskii@intel.com>
This attack is only possible on the virtio-vsock TQ because this is the
only place where our VM reads the to-be-freed descriptor index directly
from the possibly malicious host VMM.

Signed-off-by: Dmitrii Kuvaiskii <dmitrii.kuvaiskii@intel.com>
This commit refactors code for readability. It also adds more fail-fast
checks on allowed PCI devices and their configurations. Buffer overflows
and out-of-PCI-memory-region allocations are prevented.

Signed-off-by: Dmitrii Kuvaiskii <dmitrii.kuvaiskii@intel.com>
Previously, all threads waiting on some event waited on a single global
`g_waiting_events_futex` word. This means that when some event was
triggered, then *all* waiting threads were woken up. This commit
replaces this global variable with per-handle words, so only a specific
thread is woken up on the event.

Signed-off-by: Dmitrii Kuvaiskii <dmitrii.kuvaiskii@intel.com>
Among most important refactorings:
- Two pipes shared a single buffer, which was a part of one of the two
  pipes. Now replaced with a single object that both pipes reference
  (with refcount = 2).
- Pipes read and wrote in a for loop with one-byte-at-a-time, which was
  very inefficient (but done to easily handle buffer wrap around). Now
  the for loop is replaced with a while loop with memcpy.
- All possible data races are protected with locks: all
  connection-related events and the `pipe_buf` field are protected with
  `g_connecting_pipes_lock`, and all `pipe_buf` object accesses are
  protected with `pipe_buf::lock`.

Signed-off-by: Dmitrii Kuvaiskii <dmitrii.kuvaiskii@intel.com>
Previously, Gramine-VM and Gramine-TDX did not have a correct UNIX time
on startup. They simply started with the timestamp of `0` (which in
UNIX-time format means "1. January 1970").

This commit gets the UNIX time value on startup from the VMM using
"FW CFG" feature of QEMU. The obtained time value is untrusted so checks
are added to guarantee it's in a reasonable range. In the future, it
should be obtained from a trusted time source, like a trusted remote
server.

Signed-off-by: Kailun Qin <kailun.qin@intel.com>
Previously, wait-with-timeout functions registered a timeout (in a
global list) using `register_timeout()`. This func allocated a timeout
object and added it to the list of pending timeouts. Another func
`notify_about_timeouts_uninterruptable()` would remove the triggered
timeout from the list and free the timeout object itself.

However, if the wait-with-timeout function finishes before the timeout
is triggered, then the timeout would become "dangling". In reality, when
the wait-with-timeout function finishes, the timeout must be removed
from the list and freed (i.e. by the function itself).

This commit forces wait-with-timeout functions to take ownership of the
timeout object, returned by `register_timeout()` and deregister this
timeout upon finishing via `deregister_timeout()`. As a side effect, the
two users of timeout -- `PalEventWait()` and `PalStreamsWaitEvents()` --
are refactored to have uniform timeout-handling code.

Signed-off-by: Dmitrii Kuvaiskii <dmitrii.kuvaiskii@intel.com>
Previously, Gramine-TDX used `.manifest` file (same as Gramine-VM).
However, for SGX-like protections like trusted & allowed files
Gramine-TDX must use `.manifest.tdx`. But since Gramine's current
building infrastructure creates `.manifest.sgx` files, fall back to
using this file if `.manifest.tdx` wasn't found.

This fall-back to `.manifest.sgx` is a temporary measure, until Gramine
has a proper building steps for TDX.

Signed-off-by: Dmitrii Kuvaiskii <dmitrii.kuvaiskii@intel.com>
The logic is copy-pasted from Linux-SGX PAL and modified:
- no optimization of umem,
- no `pal_handle::file::seekable` field,
- `total` is renamed to `file_size`,
- `off_t` is replaced with `int64_t`.

Signed-off-by: Dmitrii Kuvaiskii <dmitrii.kuvaiskii@intel.com>
Now both `gramine-vm` and `gramine-tdx` are generated from a single
`gramine-vm.in` bash script template.

Signed-off-by: Dmitrii Kuvaiskii <dmitrii.kuvaiskii@intel.com>
`pal_socket_addr::port` is big-endian whereas `sockaddr_vm::port` is
host-byte (little-endian on x86-64). So we need a conversion. Same for
`pal_socket_addr::addr`.

Signed-off-by: Dmitrii Kuvaiskii <dmitrii.kuvaiskii@intel.com>
Previously, RTMR2 was erroneously extended with `loader.entrypoint`
(i.e. the LibOS binary). This was (a) insecure because we were reading
this file untrusted and unchecked so a malicious host could replace this
file later, and (b) irrelevant because the manifest file -- the actual
root of trust -- was not reflected anywhere in TDX measurements.

Signed-off-by: Dmitrii Kuvaiskii <dmitrii.kuvaiskii@intel.com>
Previously, Gramine-VM and Gramine-TDX were always started with 8GB of
RAM. But in reality it can be set to smaller or greater values.

This commit adds support for configurable QEMU VM memory size. We first
try to get the value from `.manifest.tdx` file, by parsing
`tdx.enclave_size` value. If this fails, we try alternatively with
`.manifest.sgx` and `sgx.enclave_size`. If file parsing fails, we then
try to get the value from environment variable `GRAMINE_RAM_SIZE`. If
both file and envvar parsing fail, we eventually fall back to the
default value 8GB.

Signed-off-by: Kailun Qin <kailun.qin@intel.com>
…utdown

Previously in `virtio_vsock_connect()` and `virtio_vsock_shutdown()`, we
were doing busy loop to wait for the connection to reach the expected
state.

This commit waits and sleeps on a timeout instead of the busy loop.

Signed-off-by: Kailun Qin <kailun.qin@intel.com>
Otherwise, shutdown and close callbacks would print annoying messages.

Signed-off-by: Dmitrii Kuvaiskii <dmitrii.kuvaiskii@intel.com>
Currently Gramine-VM/TDX is single-core. So always return a single
(first) CPU core in the CPU mask, in response to CPU get affinity.

Previous logic returned success without updating the CPU mask. This led
to applications getting an empty CPU mask, which is unexpected and e.g.
resulted in Blender raising division-by-zero exception.

Signed-off-by: Dmitrii Kuvaiskii <dmitrii.kuvaiskii@intel.com>
All `virtio_fs_fuse_*` functions were analyzed and hardened where
required. Also, several bugs were fixed in VM/TDX PAL FS code.

Signed-off-by: Dmitrii Kuvaiskii <dmitrii.kuvaiskii@intel.com>
Previously, VM/TDX PALs hard-coded page tables to start at 146MB. If
Gramine app was an EXEC ELF binary, then this binary would start at 4MB
and thus could have the size no more than 142MB. However, some apps like
Blender exceed this size limit, and Gramine failed to map such binaries.

To fix this and allow for bigger binaries, this commit moves hard-coded
memory regions higher: page tables start at 512MB, shared virtq memory
starts at 648MB. To achieve this, we had to modify page table
initialization for VM PAL to become three-phased: first the bootloader
installs static page tables to cover 32MB of memory, then we install
temporary page tables to cover 1GB of memory, and finally we install
proper page tables to cover the whole available to VM memory. Note that
this three-phased init is not needed for TDX PAL: TD-Shim already
installs initial page tables that cover [0..4GB) of memory.

This commit also moves the PAL binary to start at 1MB in VM case.
Previously it started at 1.5MB, for which there was no reason. Also, the
PAL binary in VM case became smaller in size because static page tables
are much smaller (reduced from 128 PTs to 16PTs).

Signed-off-by: Dmitrii Kuvaiskii <dmitrii.kuvaiskii@intel.com>
dimakuv and others added 25 commits April 24, 2024 03:46
Previuosly, file mmap callback did not call `memory_alloc()` to request
a memory region to be allocated into which the file contents will be
copied. This was a bug that worked before all pages are currently marked
as Present and RWX, so `memory_alloc()` is currently a no-op (other than
zeroing out the memory).

This bug was uncovered when implementing proper page permissions for
VM-based PALs. This commit adds a required call to `memory_alloc()`.
Note that this leads to a slight performance overhead, as we now
redundantly memset the newly-allocated memory region to all-zeros (and
then immediately overwrite with file contents). However, the file-mmap
operation is sufficiently rare, and correct memory management is more
important than a small perf overhead.

Signed-off-by: Dmitrii Kuvaiskii <dmitrii.kuvaiskii@intel.com>
Previously, fpregs memory region (aka XSAVE region) was allocated
in a separate `malloc()` operation when a thread was created, and it was
deallocated via `free()` during thread exit. The deallocation via
`free()` actually went through the `g_mem_bkeep_free_upcall()` LibOS
upcall path, which could go back into PAL and ultimately call
`sched_thread_wait()`. At this point, the thread may *not* have a valid
fpregs region anymore, but the thread will still be de-scheduled and
will try to save fpregs into this non-existing region, leading to an
unrecoverable page fault.

This bug was uncovered when implementing proper page permissions for
VM-based PALs. This commit fixes it by allocating fpregs memory region
together with the thread stack, so that it is never freed but instead
reused by future threads. This has a side benefit of making the thread
creation/exit code cleaner.

Signed-off-by: Dmitrii Kuvaiskii <dmitrii.kuvaiskii@intel.com>
Previously, when trying to send a normal packet on vsock TX queue that
is full, we bailed out immediately and dropped this packet. This commit
adds a retry -- if the queue is full, then we drain it first, then try
to send the packet again, and only after the second failed attempt we
bail out (and print a warning that we dropped the packet).

Signed-off-by: Dmitrii Kuvaiskii <dmitrii.kuvaiskii@intel.com>
Previously, VM-based PALs marked all pages as Present, RWX and
accessible from both userspace and kernelspace (see `pagetables_init()`
in common VM code). This commit introduces proper page permissions:

- After boot time, special memory regions like Shared Memory and Page
  Tables are marked as kernelspace-only and non-execute. All unused
  memory regions are marked as not present (not accessible).
- At run time, memory management routines take page permissions into
  account.

In the multicore case, the affected memory region must be "TLB
invalidated" on all cores (vCPUs). To this end, inter-processor
interrupts (IPIs) are used: a vCPU that wishes to change a page mapping
writes the operation to a global struct and sends an interrupt 33 to
every other vCPU. Each vCPU receives the interrupt, acknowledges the IPI
by writing to a shared variable and invalidates the TLB entries. After
all vCPUS acknowledged the IPI, the initiating vCPU finalizes the
"invalidate TLB" protocol.

Signed-off-by: Dmitrii Kuvaiskii <dmitrii.kuvaiskii@intel.com>
Previously, we hard-coded the `guest-cid` (CID) in virtio-vsock to `10`
and the UNIX socket path for `vhostfs` to `/tmp/vhostfs`. This made
running two instances of `gramine-vm` or `gramine-tdx` at the same time
impossible, with errors like these:
- from `virtio-vsock`: `unable to set guest cid: Address already in use`
- from `virtiofsd`: `Error creating pid file '/tmp/vhostfs.pid'`

This commit introduces the `GRAMINE_VM_ID` variable used to configure
the `guest-cid` and the path for `virtiofs`. `GRAMINE_VM_ID` is
determined based on the already used `guest-cid`s. We scan the running
QEMU processes and identify the `guest-cid`s of the running VMs.
Starting from our default value (`10`), we search for the next available
`guest-cid`. In this way, we avoid collisions between the instances.

Signed-off-by: Dimitrios Stavrakakis <dimitrios.stavrakakis@intel.com>
Previously, we set whatever permission the user requested on a
file-backed mmap immediately. This is wrong: the subsequent code will
update the mmapped region with file contents, so first the region must
be mmapped with write permission too, and only after the update the
write permission must be revoked (if read-only was requested by user).

Signed-off-by: Dmitrii Kuvaiskii <dmitrii.kuvaiskii@intel.com>
Previously, the `shutdown()` operation (i.e., `VIRTIO_VSOCK_OP_SHUTDOWN`
vsock operation) was incorrectly emulated as the `close()` operation. In
particular, our code didn't take into account the RCV/SEND/COMPLETE flag
of the SHUTDOWN operation. Also, the SHUTDOWN operation moved the
connection state to "closed", which prevented any further operations
like read/write on this connection, even if the shutdown was only
partial. Also, a benign RST reply on the SHUTDOWN operation wasn't
ignored (as it should be), but instead moved the connection state to
"closed".

This commit fixes all the above issues. Now we keep the connection
opened in case of a partial shutdown, and we don't panic on benign RST
packets. This fixes workloads like lighttpd + wrk.

Signed-off-by: Dmitrii Kuvaiskii <dmitrii.kuvaiskii@intel.com>
Previously, our virtio drivers didn't use suppression of notifications
at all. This commit adds the following suppression mechanisms:
- Receiving inputs on RQ of virtio-console is wrapped into "disable
  interrupts" loop.
- virtio-console sends notification on consumed inputs on RQ only if
  the device didn't set `VIRTQ_USED_F_NO_NOTIFY` flag.
- virtio-console sends notification on prepared outputs on TQ only if
  the device didn't set `VIRTQ_USED_F_NO_NOTIFY` flag.
- virtio-console disables interrupts on TQ unconditionally, because it
  performs its own clean-up of TQ.
- virtio-vsock has exactly the same suppressions as in virtio-console.
- virtio-fs disables interrupts on the `requests` queue unconditionally
  and instead always performs busy poll.

Notice that we do not use the VIRTIO_F_RING_EVENT_IDX feature, nor the
MSI-X capability. We use classic VRING_AVAIL_F_NO_INTERRUPT and
VIRTQ_USED_F_NO_NOTIFY bits in the `avail->flags` and `used->flags` of
virtio queues correspondingly.

Signed-off-by: Dmitrii Kuvaiskii <dmitrii.kuvaiskii@intel.com>
Previously, VM and TDX PALs tried to use virtio-console which may be not
yet initialized during early boot. For this early-boot phase, use Port
I/O which is always available.

Signed-off-by: Dmitrii Kuvaiskii <dmitrii.kuvaiskii@intel.com>
These are tiny fixes as Clang is stricter with suspicious programming
practices as GCC.

Signed-off-by: Dmitrii Kuvaiskii <dmitrii.kuvaiskii@intel.com>
Helper threads don't use signal alt stack, so `thread_helper_create()`
didn't allocate space for it. However, a generic `thread_setup()`
expects this alt stack and memsets it to zero. This commit fixes this
buffer overflow by allocating the alt stack for helper threads.

This bug was found by Address Sanitizer.

Signed-off-by: Dmitrii Kuvaiskii <dmitrii.kuvaiskii@intel.com>
…f()`

This commit replaces calls to internal `memory_mark_pages_off()` with
calls to a proper API function `memory_protect()`.

Signed-off-by: Dmitrii Kuvaiskii <dmitrii.kuvaiskii@intel.com>
With a previous (bogus) declaration of the `idt_start` assembly-defined
variable, newer GCC complains about "array subscript XX is outside array
bounds of char[1] (-Warray-bounds)".

Signed-off-by: Dmitrii Kuvaiskii <dmitrii.kuvaiskii@intel.com>
Functions declared in `kernel_debug.h` were ad-hoc and with bad naming
(it clashes with Gramine's own `log_*` functions). Rename to avoid
confusion and refactor slightly.

Signed-off-by: Dmitrii Kuvaiskii <dmitrii.kuvaiskii@intel.com>
Previously, there was a bug in `pagetables_init()`: PDP entries may be
"forgotten" for VMs with sizes not aligned to 1GB. For example, a VM
with 1.5GB would generate two Page Directory Tables, but there will be
only one PDP entry to the first of these two PDTs.

This commit fixes this: we use the ceiling function instead of floor
function during calculation of page tables' entries.

Signed-off-by: Dmitrii Kuvaiskii <dmitrii.kuvaiskii@intel.com>
The transmit queue can become full of pending messages, each in its
slot. The current size of the queue is 128, so if the app sends 128
(small) messages in quick succession, the queue is full and messages got
dropped. This commit fixes this by draining the queue in such case.

Signed-off-by: Dmitrii Kuvaiskii <dmitrii.kuvaiskii@intel.com>
A wrapper function is required to support callback functions that
perform `return 0` instead of `PalThreadExit()` at the end of thread
execution. Without this wrapper, the thread would jump off the stack
upon `return 0`. This was found on PAL regression test `Thread2`.

Signed-off-by: Dmitrii Kuvaiskii <dmitrii.kuvaiskii@intel.com>
Previously, the common VM code had a bug of sending a shutdown packet
even on a listening socket during `close(listening-socket)`. This is
incorrect: listening sockets don't have a connection so there is nothing
to shutdown, they are just immediately destroyed.

Signed-off-by: Dmitrii Kuvaiskii <dmitrii.kuvaiskii@intel.com>
…ailure

Previously the pipes code incorrectly returned `-PAL_ERROR_DENIED`.

Signed-off-by: Dmitrii Kuvaiskii <dmitrii.kuvaiskii@intel.com>
Also fix `IPV6_V6ONLY` support, as it missed some corner cases.

Signed-off-by: Dmitrii Kuvaiskii <dmitrii.kuvaiskii@intel.com>
Current futex triggering has a data race: if the thread that is supposed
to wait on poll/epoll (emulated in PAL via `PalStreamsWaitEvents()`) is
currently running, and another thread (on another CPU) triggers the
polled-on futex, this event is lost. This is because all polls wait on a
single global `g_streams_waiting_events_futex`, so if two threads (e.g.
Async and IPC threads) poll at the same time, then only one of them
notices the wake up, and the futex is considered triggered and not
reflected in another (currently running) thread on the next poll wait.
This may lead to hangs, e.g., in the IPC thread that waits without any
timeout.

We work around this peculiarity of current `PalStreamsWaitEvents()`
implementation by periodically triggerring this global futex, in
particular, during timer interrupts on CPU0. A proper fix would be to
introduce a triggering mechanism that uses fine-grained futexes.

Signed-off-by: Dmitrii Kuvaiskii <dmitrii.kuvaiskii@intel.com>
If Gramine-VM or -TDX is built with Address Sanitizer (ASan), we reserve
1/8th of the total VM address space (which includes the memory and PCI
holes at [2GB, 4GB)) to ASan shadow memory. We carve out this 1/8th of
the space at the very end of the usable VM address space:
- If VM has 8GB of RAM, then ASan shadow memory needs 1GB: [7G, 8G).
- Current max amount of RAM for VM is 64GB, which roughly corresponds to
  8GB of ASan shadow memory.

We force VM size to be at least 8GB for simplicity: to avoid overlapping
with reserved and MMIO memory regions (e.g., shared memory region, page
tables region, PCI MMIO region). Otherwise for example, if VM size would
be 4GB, then ASan shadow memory could not be put at [3.5GB, 4GB) as it
overlaps with the PCI MMIO region.

In addition, ASan shadow memory must be reflected in ASan-specific page
tables. These page tables do not have 1:1 mapping, as ASan shadow memory
has virtual addresses [1.5TB, 1.5TB + 8GB). Thus, these page tables must
create virtual-to-physical mapping as:

    1.5TB + offset -> ASAN_SHADOW_PHYSICAL_START + offset

Also, the "normal" page tables must disable "normal" virtual mappings to
these physical-shadow memory pages (otherwise normal memory would alias
ASan shadow memory).

Page tables' memory region is slightly increased from [512MB, 648MB) to
[512MB, 658MB), to be able to fit additional page tables for ASan shadow
memory.

Signed-off-by: Dmitrii Kuvaiskii <dmitrii.kuvaiskii@intel.com>
Signed-off-by: Dmitrii Kuvaiskii <dmitrii.kuvaiskii@intel.com>
Signed-off-by: Dmitrii Kuvaiskii <dmitrii.kuvaiskii@intel.com>
The host uses `fwd_cnt` and `buf_alloc` obtained in the guest's vsock
packets for its buffer management (deciding whether to send a next
packet to guest or back off). We previously incorrectly implemented this
by making the two variables global for all virtual sockets. In reality,
these variables are per-connection (per-socket). This mismatch between
what the host expects and what Gramine guest implements led to hangs
because on new connections, Gramine reported too many "bytes received"
in `fwd_cnt` counter, leading the host to believe there are too many
packets in flight on the new connection and to back off constantly.

Fix by moving the vars to fields of `struct virtio_vsock_connection`. An
additional benefit is that we already have proper locking for
per-connection fields, so there is no need for atomics or special
synchronization.

Interestingly, Linux prior to v6.7 did not expose this Gramine bug
because it itself had a bug of integer wrap around. This was fixed with
commit torvalds/linux@60316d7f10b17a in v6.7.

Signed-off-by: Dmitrii Kuvaiskii <dmitrii.kuvaiskii@intel.com>
Copy link
Contributor Author

@dimakuv dimakuv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 5 files reviewed, 1 unresolved discussion

a discussion (no related file):
Blocking: must test and understand all reserved ranges on new QEMU/on different environments.


Previously on the VM PAL, we hard-coded a single memory region that is
RAM (i.e., not reserved) via `FW_CFG_RAM_SIZE`. This is a dangerous
assumption which may break in the future. This commit uses a proper
source of E820 info: the `etc/e820` fw_cfg file.

As a side effect, `find_fw_cfg_selector()` function was slightly
modified to also return the size of the object. All callers of this
function now use this returned size.

Signed-off-by: Dmitrii Kuvaiskii <dmitrii.kuvaiskii@intel.com>
@dimakuv dimakuv force-pushed the dimakuv/parse-etc-e820-fwcfg-file branch from a6cba2d to fb38207 Compare May 8, 2024 13:43
Copy link
Contributor Author

@dimakuv dimakuv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 5 files reviewed, all discussions resolved

a discussion (no related file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Blocking: must test and understand all reserved ranges on new QEMU/on different environments.

Update: the problem was not related to some new reserved memory ranges on newer QEMU (v8.0).

The root cause was that docker run ... by default sets 64MB of /dev/shm. As soon as I set this limit to the one used by Gramine VMs, it started working: e.g. docker run --shm-size 4G.


Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[PAL/vm] Use etc/e820 fw_cfg file to learn reserved memory ranges
2 participants