Skip to content

Fix guest tracing memory leak when halt function is called #1032

@dblnz

Description

@dblnz

What happened?

After a relative big number of iterations of the fuzzy test, the call to the guest function hangs, never returns.

What did you expect to happen?

The expectation is for the fuzzy test to never make a guest function call hang or crash.

Steps to reproduce the behavior

This issue was observed when working on running fuzzy tests that call a recursive guest function that creates spans and log events.
The test calls a guest function that makes up to 255 recursive calls and when the halt function is called to signal the completion of the function call, it leaks the serialized data.

Hyperlight Version

0.11.0

OS version

On Linux:
$ cat /etc/os-release
PRETTY_NAME="Ubuntu 24.04.3 LTS"
NAME="Ubuntu"
VERSION_ID="24.04"
VERSION="24.04.3 LTS (Noble Numbat)"
VERSION_CODENAME=noble
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=noble
LOGO=ubuntu-logo

$ uname -a
Linux laptop 6.6.87.2-microsoft-standard-WSL2 #1 SMP PREEMPT_DYNAMIC Thu Jun  5 18:30:46 UTC 2025 x86_64 GNU/Linux

On Windows:
C:\> cmd /c ver
Not tested

Additional Information

After further debugging, I found there were two issues producing this outcome:

  • When the guest calls halt function, it wants to report the tracing data to the host, so it calls the hyperlight-guest-tracing::guest_trace_info function to serialize the events captured until then.
    This function returns a Vec<u8> and after calling the hlt instruction, the hypervisor never yields back control to the guest to correctly drop the allocated memory, which causes a leak.
  • The problem that causes the hanging is a deadlock produced by a lock of the trace data because, at some point, the heap allocation fails which in turn generates an exception which calls outb function to send the same trace data to the host, which in turn tries to get the lock.

Solution:

  • Change the logic to serialize the events as they are created, and only keep a buffer containing a chunk of serialized events. This removes the need to serialize/allocate when we want to send to the host.
    We can then directly send the buffer.
  • Never lock() the trace data Mutex, instead use try_lock() and in case there's a place we don't know what to do, it means something terrible happened so we can panic, otherwise just skip over reporting the tracing data if needed.

Metadata

Metadata

Assignees

Labels

area/securityInvolves security-related changes or fixes

Type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions