rpc: avoid ForkJoinPool compensation in send polling loop#7617
Merged
jkschneider merged 1 commit intomainfrom May 9, 2026
Merged
rpc: avoid ForkJoinPool compensation in send polling loop#7617jkschneider merged 1 commit intomainfrom
jkschneider merged 1 commit intomainfrom
Conversation
The send polling loop called future.get(checkIntervalMs, TimeUnit.MILLISECONDS), which from a ForkJoinPool worker goes through CompletableFuture.Signaller.block — a ForkJoinPool.ManagedBlocker. ManagedBlocker is the explicit hook FJP watches for to spawn a compensation worker (a fresh thread added to keep parallelism while the original is parked). The compensation worker can pick up other queued recipe-scheduler work and call PythonRewriteRpc.getOrStart() (e.g. from a printer or the lazy LazyRecipeBundleResolver supplier), spawning a fresh OS python rpc into its ThreadLocal. When FJP later terminates the idle compensation worker, its TL is GC'd — but the OS python process is independent of the JVM heap and survives. RunTask.execute()'s finally only runs shutdownCurrent on the dispatching worker, never on the dead compensation worker. Each compensation-worker spawn leaked one OS process; long-running recipe-worker JVMs accumulated 100+ alive rpcs intra-run. Replace future.get(timeout) with future.getNow(null) + Thread.sleep(1ms) polling. Thread.sleep parks via LockSupport directly — not through ManagedBlocker — so FJP doesn't compensate. Parallelism temporarily drops by one until the rpc response arrives, which is exactly the resource-bounded behavior we want from --parallel. Liveness check decoupled to fire every 500ms to preserve existing failure-detection cadence. Pairs with #7616 (JVM-exit shutdown hook): #7616 catches survivors at JVM exit; this prevents the leak from happening intra-run.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What's changed?
RewriteRpc.send's polling loop now usesfuture.getNow(null)+Thread.sleep(1ms)instead offuture.get(checkIntervalMs, TimeUnit.MILLISECONDS). Liveness check decoupled from polling cadence and fires every 500ms.What's your motivation?
future.get(long, TimeUnit)from aForkJoinPoolworker goes throughCompletableFuture.Signaller.block, which is aForkJoinPool.ManagedBlocker. ManagedBlocker is the explicit hook FJP watches for and reacts to by spawning a compensation worker — a fresh thread added to keep parallelism while the original is parked.That compensation worker can pick up other queued work (recipe load, printer invocations, etc.) and call
PythonRewriteRpc.getOrStart()/JavaScriptRewriteRpc.getOrStart()/ etc., spawning a fresh OS rpc subprocess into itsThreadLocal. When FJP later terminates the idle compensation worker, its TL is GC'd — but the OS process is independent of the JVM heap and survives. The dispatching worker'sRunTask.execute()finally only runsshutdownCurrent()on the dispatching thread's TL, never on the dead compensation worker.On a moderne-cli
mod runagainst ~448 Python repos, this accumulated 127+ alive python rpc processes per long-running JVM, with each leaked rpc carrying a different past repo's log path in argv. Each compensation-worker spawn = one leaked OS process. The same mechanism applies to JS / C# / Go rpcs.Thread.sleepparks the thread viaLockSupportdirectly, not through ManagedBlocker, so FJP doesn't compensate. Parallelism temporarily drops by one while a worker is inrpc.send, which is exactly the resource-bounded behavior--parallelis supposed to mean.Verified locally with a 20-repo Python LST sample running
UpgradeToPython314at--parallel=14:Before: spawn count >> repo count under load (compensation amplification)
After: spawn count = repo count, alive rpc count ≤
--parallel, no compensation worker thread names in jstack,final_alive=0after run.Pairs with rpc: kill child processes at JVM exit via shutdown hook #7616
rpc: kill child processes at JVM exit via shutdown hook #7616 added a JVM-shutdown-hook on
RewriteRpcProcessto kill child processes at JVM exit — that catches survivors at process termination. This PR prevents the leak from happening intra-run by removing its trigger.Checklist