Skip to content

bmalloc: add 1ms backoff and retry cap to SYSCALL/PAS_SYSCALL EAGAIN loops#169

Draft
coleleavitt wants to merge 2 commits intooven-sh:mainfrom
coleleavitt:fix/bmalloc-syscall-eagain-backoff
Draft

bmalloc: add 1ms backoff and retry cap to SYSCALL/PAS_SYSCALL EAGAIN loops#169
coleleavitt wants to merge 2 commits intooven-sh:mainfrom
coleleavitt:fix/bmalloc-syscall-eagain-backoff

Conversation

@coleleavitt
Copy link

Summary

The SYSCALL and PAS_SYSCALL macros in bmalloc retry syscalls on EAGAIN in a zero-delay tight loop. When madvise(MADV_DONTDUMP) returns EAGAIN due to kernel mmap_write_lock contention, this causes 100% CPU usage across all GC threads — effectively freezing the application.

This PR adds usleep(1000) backoff (1ms) and caps retries at 100 (100ms total).

Root Cause Analysis

LLM streaming / heavy allocation workload
  → JSC allocation-triggered GC fires
  → GC sweep calls bmalloc vmDeallocatePhysicalPages() thousands of times
  → Each call does TWO madvise syscalls: MADV_DONTNEED + MADV_DONTDUMP
  → MADV_DONTDUMP requires kernel mmap_write_lock (unlike MADV_DONTNEED which only needs read lock)
  → Multiple GC threads contend on single process-wide mmap_write_lock
  → Kernel returns EAGAIN (VMA split/merge allocation failure under memory pressure)
  → SYSCALL macro retries in ZERO-DELAY infinite loop: while((x)==-1 && errno==EAGAIN){}
  → 250K+ madvise calls/sec/thread, 100% CPU, application frozen

The Smoking Gun

BSyscall.h (before):

#define SYSCALL(x) do { \
    while ((x) == -1 && errno == EAGAIN) { } \
} while (0);

pas_utils.h (before):

#define PAS_SYSCALL(x) do { \
    while ((x) == -1 && errno == EAGAIN) { } \
} while (0)

Zero-delay infinite retry. No backoff, no sleep, no yield, no retry cap.

The Fix

// BSyscall.h
#define SYSCALL(x) do { \
    int _syscall_tries = 0; \
    while ((x) == -1 && errno == EAGAIN) { \
        if (++_syscall_tries > 100) break; \
        usleep(1000); \
    } \
} while (0);

// pas_utils.h
#define PAS_SYSCALL(x) do { \
    int _pas_syscall_tries = 0; \
    while ((x) == -1 && errno == EAGAIN) { \
        if (++_pas_syscall_tries > 100) break; \
        usleep(1000); \
    } \
} while (0)

Why This Approach

  • 1ms fixed delay is ~1000× longer than kernel lock hold time — more than enough for contention to clear
  • 100 retry cap (100ms total) prevents infinite loops under pathological conditions; madvise failures here are advisory, not fatal
  • Zero fast-path impact — the while body is dead code when the syscall succeeds
  • Matches existing Windows precedentvirtual_alloc_with_retry() in libpas/pas_page_malloc.c already uses Sleep(50ms) with 10 max retries
  • Consistent with tcmalloc — Google's tcmalloc uses bounded retries (3 attempts) for expensive madvise operations (source)

Why NOT sched_yield()

Per Red Hat RHEL-RT Tuning Guide: sched_yield can reschedule immediately (busy loop) or after long delay — unpredictable behavior. usleep(1000) provides deterministic 1ms backoff.

Blast Radius

17 callsites affected (all madvise/mprotect/mincore — all benefit from this fix):

  • 6 in bmalloc/VMAllocate.h (madvise calls in vmDeallocatePhysicalPages and vmAllocatePhysicalPages)
  • 9 in libpas/pas_page_malloc.c (madvise/mprotect in commit_impl and decommit_impl)
  • 2 in libpas/pas_committed_pages_vector.c (mincore in pas_committed_pages_vector_construct)

Upstream Status

Apple's upstream WebKit has the identical zero-delay SYSCALL macro and has not addressed this. This fix is novel.

Related Issues

Complementary Fix (Not in This PR)

MADV_DONTDUMP (the specific call that takes mmap_write_lock) could also be removed or made optional. It only affects core dump size, not allocation correctness. However, that's a behavioral change best evaluated separately.

…loops

The SYSCALL and PAS_SYSCALL macros retry syscalls on EAGAIN in a
zero-delay tight loop. When madvise(MADV_DONTDUMP) returns EAGAIN due
to kernel mmap_write_lock contention (VMA split/merge allocation
failure under memory pressure), this causes 100% CPU usage across
all GC threads — effectively freezing the application.

Add usleep(1000) backoff (1ms) and cap retries at 100 (100ms total).
madvise failures here are advisory, not fatal, so breaking after max
retries is safe. This matches the existing Windows precedent in
libpas/pas_page_malloc.c virtual_alloc_with_retry() which uses
Sleep(50ms) with 10 max retries.

Upstream Apple WebKit has the same zero-delay loop and has not yet
addressed this. tcmalloc uses bounded retries (3 attempts) for
expensive madvise operations. sched_yield() was considered but is
explicitly not recommended for this use case (Red Hat RHEL-RT guide).

Related: oven-sh/bun#17723, oven-sh/bun#27371, oven-sh/bun#27196,
google/tcmalloc#247, golang/go#61718
…k contention

MADV_DONTDUMP is the sole cause of the mmap_write_lock contention that
triggers the EAGAIN spin loop fixed in the previous commit. Unlike
MADV_DONTNEED which only acquires the kernel's mmap_read_lock (no
contention), MADV_DONTDUMP requires mmap_write_lock — a single
process-wide exclusive lock.

With concurrent GC threads all calling vmDeallocatePhysicalPages(),
MADV_DONTDUMP creates a serialization point in the kernel. Under
memory pressure, VMA split/merge allocation fails and the kernel
returns EAGAIN, which (before the previous fix) caused 100% CPU spin.

MADV_DONTDUMP only affects core dump size — it has zero impact on
memory reclamation or allocation correctness. MADV_DODUMP (its
symmetric counterpart in vmAllocatePhysicalPages/commit_impl) is
also removed.

This is the root cause elimination (vs the previous commit which
is the defensive mitigation). Together they fully resolve the issue.

Removed 4 madvise calls:
- VMAllocate.h vmDeallocatePhysicalPages: MADV_DONTDUMP
- VMAllocate.h vmAllocatePhysicalPages: MADV_DODUMP
- pas_page_malloc.c decommit_impl: MADV_DONTDUMP
- pas_page_malloc.c commit_impl: MADV_DODUMP
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant