Auto-start CPU sandboxes for sessions by lewtun · Pull Request #200 · huggingface/ml-intern

lewtun · 2026-05-01T13:25:59Z

Summary

This PR starts a private cpu-basic sandbox automatically for each agent session so normal CPU code execution can use bash, read, write, and edit without requiring a manual sandbox approval first.

It keeps approval in place for GPU and other non-default sandbox hardware, and it adds best-effort sandbox teardown on browser/tab close without deleting durable chat history.

Changes

Schedule a background cpu-basic sandbox preload for newly created and restored sessions.
Let sandbox operation tools wait for an in-flight preload and surface preload failures only when execution is attempted.
Keep sandbox_create approval-free for the default CPU sandbox while preserving manual approval for GPU tiers and cpu-upgrade.
Allow approved GPU/non-default sandbox creation to replace the auto-started CPU sandbox.
Add POST /api/session/{session_id}/sandbox/teardown and a frontend pagehide hook for best-effort browser-close cleanup.
Update the system prompt and sandbox tool descriptions so the agent knows CPU sandbox execution is already available.

Verification

uv run --extra dev pytest tests/unit/test_sandbox_auto_start.py tests/unit/test_sandbox_private_spaces.py tests/unit/test_sandbox_already_active_message.py tests/unit/test_session_manager_persistence.py
npm run build
npx eslint src/components/Layout/AppLayout.tsx

Full npm run lint still reports existing unrelated lint issues in ActivityStatusBar.tsx, agentStore.ts, main.tsx, and logger.ts.

Co-authored-by: Codex <codex@openai.com>

github-actions · 2026-05-01T13:26:14Z

Claude finished @lewtun's task in 5m 57s —— View job

PR Review: Auto-start CPU sandboxes for sessions

0 P0, 3 P1
Verdict: ready to merge

No blocking issues. The core mechanics — preload lifecycle, cancel-on-GPU-upgrade, auth on the teardown route, and the cancel_flag race analysis — are all correct. Three P1s worth a follow-up.

What I checked

_needs_approval logic in agent_loop.py — hardware key lookup, approval bypass for cpu-basic
start_cpu_sandbox_preload / get_active_or_preloaded_sandbox / cancel_sandbox_preload — full lifecycle including race between preload completion and GPU upgrade request
sandbox_create_handler — CPU-sandbox teardown before GPU creation, all branch paths traced
teardown_session_sandbox in sandbox_tool.py — cancel_event propagation, session.sandbox = None idempotency
teardown_sandbox route in backend/routes/agent.py — auth chain, fire-and-forget task
AppLayout.tsx pagehide hook — sendBeacon credential behavior (auth uses hf_access_token cookie, which sendBeacon includes for same-origin requests — safe)
Unit tests — test_sandbox_auto_start.py, test_sandbox_private_spaces.py, test_session_manager_persistence.py

P1 — Fire-and-forget task can be silently GC'd

backend/routes/agent.py:547

asyncio.create_task(session_manager.teardown_sandbox(session_id))

Python's asyncio docs warn explicitly: "Save a reference to the result of this function to avoid a task disappearing mid-execution. The event loop only keeps weak references to tasks." If the GC runs before the teardown coroutine finishes, the task is collected and the sandbox is not deleted — exactly what the sweep script is designed to catch, but this makes every pagehide a potential orphan. Standard fix:

task = asyncio.create_task(session_manager.teardown_sandbox(session_id))
task.add_done_callback(lambda _: None)  # keep strong ref alive until done

Or keep a module-level set of background tasks and discard on completion.

Fix this →

P1 — `cancel_sandbox_preload` leaks the asyncio task after timeout

agent/tools/sandbox_tool.py:355-358

await asyncio.wait_for(asyncio.shield(task), timeout=30)

asyncio.shield prevents cancellation from propagating into task. When wait_for times out after 30 s, it cancels the shield wrapper, logs a warning, and returns — but the underlying preload task continues running indefinitely. The cancel_event (threading.Event) was already set, so the thread should stop eventually and session.sandbox won't be overwritten, but the asyncio task object itself stays live until the thread finishes, with no further tracking. For a best-effort operation this is acceptable, but the logged warning should note the task is still live so operators aren't confused. Consider calling task.cancel() after the timeout as well, so the coroutine wrapper also terminates promptly once the thread has stopped.

Fix this →

P1 — Session restore path not covered by preload tests

backend/session_manager.py:499 calls _start_cpu_sandbox_preload on the lazy-restore path inside ensure_session_loaded. The new test (test_create_session_schedules_cpu_sandbox_preload) only monkey-patches the create_session path. A restored session that returns via the started is not agent_session branch (line 492) skips _start_cpu_sandbox_preload entirely — which is intentional, but not explicitly tested. A complementary test for the ensure_session_loaded restore path would confirm the preload fires (and only once) when an inactive session is re-connected.

P2 — Cargo-culted hardware key aliases in `_needs_approval`

agent/core/agent_loop.py:129-132

hardware = (
    tool_args.get("hardware")
    or tool_args.get("hardware_flavor")   # ← not in sandbox_create schema
    or tool_args.get("flavor")            # ← not in sandbox_create schema
    or DEFAULT_CPU_SANDBOX_HARDWARE
)

sandbox_create only exposes hardware. The hardware_flavor / flavor aliases are copy-pasted from the hf_jobs check directly below. They're harmless (missing keys return None) but confusing. Removing them would make the intent clearer.

Co-authored-by: Codex <codex@openai.com>

github-actions · 2026-05-01T13:37:43Z

Claude finished @lewtun's task in 1m 38s —— View job

PR Review: Auto-start CPU sandboxes for sessions (re-review on `e19c139`)

0 P0
Verdict: ready to merge

This re-review covers only the new commit (e19c139 Retry sandbox runtime visibility checks). Per review policy, previously flagged P1s are not re-posted.

What I checked

_is_transient_space_visibility_error helper — response attribute lookup and string-match fallback, no injection surface
Sandbox.create wait loop at sandbox_client.py:643-667 — deadline bound (WAIT_TIMEOUT = 600), _check_cancel() still called at loop top, no infinite-retry risk
Test test_sandbox_client_retries_transient_runtime_404 in test_sandbox_private_spaces.py:40 — raises 404 on first call, returns RUNNING on second, verifies runtime_calls == 2; adequate coverage for the new path
Previously flagged P1s — backend/routes/agent.py:547 fire-and-forget task and cancel_sandbox_preload timeout task leak are not fixed but were already flagged; not re-posted

Status of previously flagged P1s

The three P1s from the prior review were not addressed in this commit. They stand as informational items for a follow-up; they are not re-flagged here.

Co-authored-by: OpenAI Codex <codex@openai.com>

github-actions · 2026-05-01T14:18:58Z

Claude finished @lewtun's task in 3m 24s —— View job

PR Review: Auto-start CPU sandboxes for sessions (re-review on `683102a`)

Gather context and read changed files
Analyze latest commit for P0 issues
Post final review

0 P0
Verdict: ready to merge

This re-review covers only the new commit (683102a Stabilize auto CPU sandbox creation). Per review policy, previously flagged P1s are not re-posted.

What I checked

request_space_hardware call added at sandbox_client.py:630 — correct placement (after duplicate_space, before secrets injection); sleep_time threading is consistent with the duplicate kwargs
Hardware-mismatch wait loop at sandbox_client.py:663–672 — loops while runtime.hardware != hardware when RUNNING; SpaceHardware is str, Enum so string comparison with "cpu-basic" is correct; None hardware causes additional waiting but is bounded by WAIT_TIMEOUT=600
_get_sandbox_create_lock at sandbox_tool.py:50–57 — asyncio single-thread model makes the dict mutation safe; WeakKeyDictionary keyed on the event loop prevents cross-loop lock sharing
Double-checked locking in _ensure_sandbox at sandbox_tool.py:223–225 — correct: inner session.sandbox check re-runs under the lock; two sessions for same owner each get their own sandbox (second session's session.sandbox is still None, proceeds correctly)
_create_sandbox_locked cancel-after-success path at sandbox_tool.py:305–311 — sb is always bound if this block is reached (non-Cancelled exception would propagate before reaching the check); cleanup and early return are correct
Orphan cleanup removal from creation path — intentional; _cleanup_user_orphan_sandboxes is still defined but no longer called at creation time; the monkeypatch in test_sandbox_creation_is_serialized_per_owner is a no-op
list_spaces(full=True) fix at sandbox_tool.py:147 — required to populate lastModified; last_mod is None guard at line 163 prevents deleting spaces with missing timestamps
New tests — test_orphan_sweep_preserves_spaces_without_last_modified and test_sandbox_creation_is_serialized_per_owner cover both new behaviors; max_active_creates == 1 assertion correctly verifies serialization

Status of previously flagged P1s

The three P1s from the prior review (agent.py:547 fire-and-forget GC, cancel_sandbox_preload task leak on timeout, session restore preload test coverage) were not addressed in this commit. They stand as informational items for a follow-up; not re-flagged here.

github-actions · 2026-05-01T17:06:25Z

Claude finished @lewtun's task in 3m 46s —— View job

PR Review: Auto-start CPU sandboxes for sessions (re-review on `5543a34`)

Gather context and read changed files
Analyze latest commits for P0 issues
Post final review

0 P0
Verdict: ready to merge

This re-review covers only the new commit (5543a34 Merge branch 'main' into codex/auto-start-cpu-sandbox), which pulls the YOLO auto-approval budget feature from main. Per review policy, previously flagged P1s are not re-posted.

What I checked

YOLO × cpu-basic sandbox interaction (agent_loop.py:144-165, cost_estimation.py:46-47): _is_budgeted_auto_approval_target returns True for all sandbox_create calls, but SPACE_PRICE_USD_PER_HOUR["cpu-basic"] = 0.0 makes billable=False and estimated_cost_usd=0.0. The budget path is entered but never charges the cap; no regression.
YOLO × GPU sandbox upgrade path: _base_needs_approval returns True for non-cpu hardware; _approval_decision correctly routes through the budget check before auto-approving; sandbox_create_handler:594-598 tears down the CPU sandbox before GPU creation — sequence is correct.
Auto-preload bypasses approval entirely: start_cpu_sandbox_preload calls _ensure_sandbox directly, so no _approval_decision call and no budget deduction for the implicit CPU sandbox. Correct.
PATCH /session/{session_id}/yolo auth chain (routes/agent.py:502-522): _check_session_access + Depends(get_current_user) guard is consistent with other write routes. No auth gap.
update_session_auto_approval (session_manager.py:933-968): cap defaults (DEFAULT_YOLO_COST_CAP_USD = 5.0) applied correctly; persist_session_snapshot called after mutation — no state-loss path.
Merge resolution of session.py: preload attributes (sandbox_hardware, sandbox_preload_task, etc.) and YOLO attributes (auto_approval_enabled, auto_approval_cost_cap_usd) coexist correctly at lines 119–129; no collision.

Status of previously flagged P1s

The three P1s from the prior reviews (agent.py:568 fire-and-forget GC, cancel_sandbox_preload task leak on timeout, session restore preload test coverage) were not addressed in this commit. They stand as informational items for a follow-up; not re-flagged here.

Co-authored-by: OpenAI Codex <codex@openai.com>

github-actions · 2026-05-01T17:52:49Z

Claude finished @lewtun's task in 2m 12s —— View job

PR Review: Auto-start CPU sandboxes for sessions (re-review on `06a4326`)

Gather context and read changed files
Analyze latest commit for previously flagged P1s and new issues
Post final review

0 P0
Verdict: ready to merge

This re-review covers only the new commit (06a4326 Address sandbox PR review comments). All three P1s flagged in prior reviews are now fixed.

What I checked

P1 fix — fire-and-forget GC (backend/routes/agent.py:569–571): module-level _background_teardown_tasks: set[asyncio.Task] added; task is inserted before the done callback fires and removed via discard on completion. The ordering is safe under asyncio's single-threaded model — the scheduled coroutine won't execute until the current frame yields.
P1 fix — cancel_sandbox_preload task leak (sandbox_tool.py:400–405): task.cancel() is now called after the 30-second timeout and the log message correctly states the task was still live. Traced through the test (test_cancel_sandbox_preload_cancels_task_after_timeout): fake_wait_for raises TimeoutError, task.cancel() is called, await asyncio.sleep(0) lets the loop process the cancellation; task.cancelled() and cancel_event.is_set() assertions both pass.
P1 fix — session restore path coverage (test_session_manager_persistence.py:234–258): test_lazy_restore_schedules_cpu_sandbox_preload patches _start_cpu_sandbox_preload and asserts it fires for the restored session. test_concurrent_lazy_restore_starts_only_one_agent_task was also updated to assert scheduled == ["persisted-session"], confirming the preload fires exactly once even under concurrent restores.
P2 fix — hardware alias cleanup (agent/core/agent_loop.py:159): hardware_flavor and flavor aliases removed; now just tool_args.get("hardware") or DEFAULT_CPU_SANDBOX_HARDWARE. Clean.
No new code paths introduced that could carry new P0 risk.

Status of previously flagged P1s

All three prior P1s are resolved. No outstanding informational items.

lewtun · 2026-05-01T18:34:11Z


-  For non-trivial scripts, develop and test in a sandbox before launching via hf_jobs:
-    sandbox_create → install deps → write script → test with small run → fix errors → launch via hf_jobs at scale
+  A private cpu-basic sandbox is already available for normal code execution in each session. For non-trivial scripts, develop and test there before launching via hf_jobs:


FYI @akseljoonas

Auto-start CPU sandboxes for sessions

05fde84

Co-authored-by: Codex <codex@openai.com>

Retry sandbox runtime visibility checks

e19c139

Co-authored-by: Codex <codex@openai.com>

Stabilize auto CPU sandbox creation

683102a

Co-authored-by: OpenAI Codex <codex@openai.com>

Merge branch 'main' into codex/auto-start-cpu-sandbox

5543a34

Address sandbox PR review comments

06a4326

Co-authored-by: OpenAI Codex <codex@openai.com>

lewtun commented May 1, 2026

View reviewed changes

lewtun merged commit 8615c28 into main May 1, 2026
3 checks passed

lewtun deleted the codex/auto-start-cpu-sandbox branch May 1, 2026 18:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Auto-start CPU sandboxes for sessions#200

Auto-start CPU sandboxes for sessions#200
lewtun merged 5 commits into
mainfrom
codex/auto-start-cpu-sandbox

lewtun commented May 1, 2026

Uh oh!

github-actions Bot commented May 1, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 1, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 1, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 1, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 1, 2026 •

edited

Loading

Uh oh!

lewtun May 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

lewtun commented May 1, 2026

Summary

Changes

Verification

Uh oh!

github-actions Bot commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: Auto-start CPU sandboxes for sessions

What I checked

P1 — Fire-and-forget task can be silently GC'd

P1 — cancel_sandbox_preload leaks the asyncio task after timeout

P1 — Session restore path not covered by preload tests

P2 — Cargo-culted hardware key aliases in _needs_approval

Uh oh!

github-actions Bot commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: Auto-start CPU sandboxes for sessions (re-review on e19c139)

What I checked

Status of previously flagged P1s

Uh oh!

github-actions Bot commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: Auto-start CPU sandboxes for sessions (re-review on 683102a)

What I checked

Status of previously flagged P1s

Uh oh!

github-actions Bot commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: Auto-start CPU sandboxes for sessions (re-review on 5543a34)

What I checked

Status of previously flagged P1s

Uh oh!

github-actions Bot commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: Auto-start CPU sandboxes for sessions (re-review on 06a4326)

What I checked

Status of previously flagged P1s

Uh oh!

lewtun May 1, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

github-actions Bot commented May 1, 2026 •

edited

Loading

P1 — `cancel_sandbox_preload` leaks the asyncio task after timeout

P2 — Cargo-culted hardware key aliases in `_needs_approval`

github-actions Bot commented May 1, 2026 •

edited

Loading

PR Review: Auto-start CPU sandboxes for sessions (re-review on `e19c139`)

github-actions Bot commented May 1, 2026 •

edited

Loading

PR Review: Auto-start CPU sandboxes for sessions (re-review on `683102a`)

github-actions Bot commented May 1, 2026 •

edited

Loading

PR Review: Auto-start CPU sandboxes for sessions (re-review on `5543a34`)

github-actions Bot commented May 1, 2026 •

edited

Loading

PR Review: Auto-start CPU sandboxes for sessions (re-review on `06a4326`)