bmalloc: add 1ms backoff and retry cap to SYSCALL/PAS_SYSCALL EAGAIN loops by coleleavitt · Pull Request #169 · oven-sh/WebKit

coleleavitt · 2026-02-27T00:49:26Z

Summary

The SYSCALL and PAS_SYSCALL macros in bmalloc retry syscalls on EAGAIN in a zero-delay tight loop. When madvise(MADV_DONTDUMP) returns EAGAIN due to kernel mmap_write_lock contention, this causes 100% CPU usage across all GC threads — effectively freezing the application.

This PR adds usleep(1000) backoff (1ms) and caps retries at 100 (100ms total).

Root Cause Analysis

LLM streaming / heavy allocation workload
  → JSC allocation-triggered GC fires
  → GC sweep calls bmalloc vmDeallocatePhysicalPages() thousands of times
  → Each call does TWO madvise syscalls: MADV_DONTNEED + MADV_DONTDUMP
  → MADV_DONTDUMP requires kernel mmap_write_lock (unlike MADV_DONTNEED which only needs read lock)
  → Multiple GC threads contend on single process-wide mmap_write_lock
  → Kernel returns EAGAIN (VMA split/merge allocation failure under memory pressure)
  → SYSCALL macro retries in ZERO-DELAY infinite loop: while((x)==-1 && errno==EAGAIN){}
  → 250K+ madvise calls/sec/thread, 100% CPU, application frozen

The Smoking Gun

BSyscall.h (before):

#define SYSCALL(x) do { \
    while ((x) == -1 && errno == EAGAIN) { } \
} while (0);

pas_utils.h (before):

#define PAS_SYSCALL(x) do { \
    while ((x) == -1 && errno == EAGAIN) { } \
} while (0)

Zero-delay infinite retry. No backoff, no sleep, no yield, no retry cap.

The Fix

// BSyscall.h
#define SYSCALL(x) do { \
    int _syscall_tries = 0; \
    while ((x) == -1 && errno == EAGAIN) { \
        if (++_syscall_tries > 100) break; \
        usleep(1000); \
    } \
} while (0);

// pas_utils.h
#define PAS_SYSCALL(x) do { \
    int _pas_syscall_tries = 0; \
    while ((x) == -1 && errno == EAGAIN) { \
        if (++_pas_syscall_tries > 100) break; \
        usleep(1000); \
    } \
} while (0)

Why This Approach

1ms fixed delay is ~1000× longer than kernel lock hold time — more than enough for contention to clear
100 retry cap (100ms total) prevents infinite loops under pathological conditions; madvise failures here are advisory, not fatal
Zero fast-path impact — the while body is dead code when the syscall succeeds
Matches existing Windows precedent — virtual_alloc_with_retry() in libpas/pas_page_malloc.c already uses Sleep(50ms) with 10 max retries
Consistent with tcmalloc — Google's tcmalloc uses bounded retries (3 attempts) for expensive madvise operations (source)

Why NOT sched_yield()

Per Red Hat RHEL-RT Tuning Guide: sched_yield can reschedule immediately (busy loop) or after long delay — unpredictable behavior. usleep(1000) provides deterministic 1ms backoff.

Blast Radius

17 callsites affected (all madvise/mprotect/mincore — all benefit from this fix):

6 in bmalloc/VMAllocate.h (madvise calls in vmDeallocatePhysicalPages and vmAllocatePhysicalPages)
9 in libpas/pas_page_malloc.c (madvise/mprotect in commit_impl and decommit_impl)
2 in libpas/pas_committed_pages_vector.c (mincore in pas_committed_pages_vector_construct)

Upstream Status

Apple's upstream WebKit has the identical zero-delay SYSCALL macro and has not addressed this. This fix is novel.

Related Issues

Moving from Node to Bun spikes container CPU and memory usage until it crashes bun#17723 — Moving from Node to Bun spikes container CPU and memory usage (41 👍)
Bun gets OOM-killed on trivial scripts bun#27371 — OOM on trivial scripts (bun -e "console.log('ok')")
OOMKilled running Prisma migrations in Kubernetes - regression in 1.3.9 (works in 1.3.8) bun#27196 — OOMKilled on Prisma migrations (1.3.9 regression)
Crash in bmalloc (pas_assertion_failed) on Windows x64 baseline in standalone executable bun#26982 — Crash in bmalloc after ~73 minutes
tcmalloc making syscall madvise(MADV_DONTNEED) from a NOHZ_FULL cpu is causing the kernel to kick the CPU by raising a local timer interrupt google/tcmalloc#247 — madvise(MADV_DONTNEED) causing kernel interrupt storms
runtime: MADV_HUGEPAGE causes stalls when allocating memory golang/go#61718 — MADV_HUGEPAGE causes 100+ ms stalls

Complementary Fix (Not in This PR)

MADV_DONTDUMP (the specific call that takes mmap_write_lock) could also be removed or made optional. It only affects core dump size, not allocation correctness. However, that's a behavioral change best evaluated separately.

…loops The SYSCALL and PAS_SYSCALL macros retry syscalls on EAGAIN in a zero-delay tight loop. When madvise(MADV_DONTDUMP) returns EAGAIN due to kernel mmap_write_lock contention (VMA split/merge allocation failure under memory pressure), this causes 100% CPU usage across all GC threads — effectively freezing the application. Add usleep(1000) backoff (1ms) and cap retries at 100 (100ms total). madvise failures here are advisory, not fatal, so breaking after max retries is safe. This matches the existing Windows precedent in libpas/pas_page_malloc.c virtual_alloc_with_retry() which uses Sleep(50ms) with 10 max retries. Upstream Apple WebKit has the same zero-delay loop and has not yet addressed this. tcmalloc uses bounded retries (3 attempts) for expensive madvise operations. sched_yield() was considered but is explicitly not recommended for this use case (Red Hat RHEL-RT guide). Related: oven-sh/bun#17723, oven-sh/bun#27371, oven-sh/bun#27196, google/tcmalloc#247, golang/go#61718

…k contention MADV_DONTDUMP is the sole cause of the mmap_write_lock contention that triggers the EAGAIN spin loop fixed in the previous commit. Unlike MADV_DONTNEED which only acquires the kernel's mmap_read_lock (no contention), MADV_DONTDUMP requires mmap_write_lock — a single process-wide exclusive lock. With concurrent GC threads all calling vmDeallocatePhysicalPages(), MADV_DONTDUMP creates a serialization point in the kernel. Under memory pressure, VMA split/merge allocation fails and the kernel returns EAGAIN, which (before the previous fix) caused 100% CPU spin. MADV_DONTDUMP only affects core dump size — it has zero impact on memory reclamation or allocation correctness. MADV_DODUMP (its symmetric counterpart in vmAllocatePhysicalPages/commit_impl) is also removed. This is the root cause elimination (vs the previous commit which is the defensive mitigation). Together they fully resolve the issue. Removed 4 madvise calls: - VMAllocate.h vmDeallocatePhysicalPages: MADV_DONTDUMP - VMAllocate.h vmAllocatePhysicalPages: MADV_DODUMP - pas_page_malloc.c decommit_impl: MADV_DONTDUMP - pas_page_malloc.c commit_impl: MADV_DODUMP

coleleavitt mentioned this pull request Feb 27, 2026

bmalloc SYSCALL macro spins at 100% CPU on madvise EAGAIN — zero-delay retry loop oven-sh/bun#27490

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bmalloc: add 1ms backoff and retry cap to SYSCALL/PAS_SYSCALL EAGAIN loops#169

bmalloc: add 1ms backoff and retry cap to SYSCALL/PAS_SYSCALL EAGAIN loops#169
coleleavitt wants to merge 2 commits intooven-sh:mainfrom
coleleavitt:fix/bmalloc-syscall-eagain-backoff

coleleavitt commented Feb 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

coleleavitt commented Feb 27, 2026

Summary

Root Cause Analysis

The Smoking Gun

The Fix

Why This Approach

Why NOT sched_yield()

Blast Radius

Upstream Status

Related Issues

Complementary Fix (Not in This PR)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant