Skip to content

Auto-start CPU sandboxes for sessions#200

Merged
lewtun merged 5 commits into
mainfrom
codex/auto-start-cpu-sandbox
May 1, 2026
Merged

Auto-start CPU sandboxes for sessions#200
lewtun merged 5 commits into
mainfrom
codex/auto-start-cpu-sandbox

Conversation

@lewtun
Copy link
Copy Markdown
Member

@lewtun lewtun commented May 1, 2026

Summary

This PR starts a private cpu-basic sandbox automatically for each agent session so normal CPU code execution can use bash, read, write, and edit without requiring a manual sandbox approval first.

It keeps approval in place for GPU and other non-default sandbox hardware, and it adds best-effort sandbox teardown on browser/tab close without deleting durable chat history.

Changes

  • Schedule a background cpu-basic sandbox preload for newly created and restored sessions.
  • Let sandbox operation tools wait for an in-flight preload and surface preload failures only when execution is attempted.
  • Keep sandbox_create approval-free for the default CPU sandbox while preserving manual approval for GPU tiers and cpu-upgrade.
  • Allow approved GPU/non-default sandbox creation to replace the auto-started CPU sandbox.
  • Add POST /api/session/{session_id}/sandbox/teardown and a frontend pagehide hook for best-effort browser-close cleanup.
  • Update the system prompt and sandbox tool descriptions so the agent knows CPU sandbox execution is already available.

Verification

  • uv run --extra dev pytest tests/unit/test_sandbox_auto_start.py tests/unit/test_sandbox_private_spaces.py tests/unit/test_sandbox_already_active_message.py tests/unit/test_session_manager_persistence.py
  • npm run build
  • npx eslint src/components/Layout/AppLayout.tsx

Full npm run lint still reports existing unrelated lint issues in ActivityStatusBar.tsx, agentStore.ts, main.tsx, and logger.ts.

Co-authored-by: Codex <codex@openai.com>
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 1, 2026

Claude finished @lewtun's task in 5m 57s —— View job


PR Review: Auto-start CPU sandboxes for sessions

0 P0, 3 P1
Verdict: ready to merge

No blocking issues. The core mechanics — preload lifecycle, cancel-on-GPU-upgrade, auth on the teardown route, and the cancel_flag race analysis — are all correct. Three P1s worth a follow-up.


What I checked

  • _needs_approval logic in agent_loop.py — hardware key lookup, approval bypass for cpu-basic
  • start_cpu_sandbox_preload / get_active_or_preloaded_sandbox / cancel_sandbox_preload — full lifecycle including race between preload completion and GPU upgrade request
  • sandbox_create_handler — CPU-sandbox teardown before GPU creation, all branch paths traced
  • teardown_session_sandbox in sandbox_tool.py — cancel_event propagation, session.sandbox = None idempotency
  • teardown_sandbox route in backend/routes/agent.py — auth chain, fire-and-forget task
  • AppLayout.tsx pagehide hook — sendBeacon credential behavior (auth uses hf_access_token cookie, which sendBeacon includes for same-origin requests — safe)
  • Unit tests — test_sandbox_auto_start.py, test_sandbox_private_spaces.py, test_session_manager_persistence.py

P1 — Fire-and-forget task can be silently GC'd

backend/routes/agent.py:547

asyncio.create_task(session_manager.teardown_sandbox(session_id))

Python's asyncio docs warn explicitly: "Save a reference to the result of this function to avoid a task disappearing mid-execution. The event loop only keeps weak references to tasks." If the GC runs before the teardown coroutine finishes, the task is collected and the sandbox is not deleted — exactly what the sweep script is designed to catch, but this makes every pagehide a potential orphan. Standard fix:

task = asyncio.create_task(session_manager.teardown_sandbox(session_id))
task.add_done_callback(lambda _: None)  # keep strong ref alive until done

Or keep a module-level set of background tasks and discard on completion.

Fix this →


P1 — cancel_sandbox_preload leaks the asyncio task after timeout

agent/tools/sandbox_tool.py:355-358

await asyncio.wait_for(asyncio.shield(task), timeout=30)

asyncio.shield prevents cancellation from propagating into task. When wait_for times out after 30 s, it cancels the shield wrapper, logs a warning, and returns — but the underlying preload task continues running indefinitely. The cancel_event (threading.Event) was already set, so the thread should stop eventually and session.sandbox won't be overwritten, but the asyncio task object itself stays live until the thread finishes, with no further tracking. For a best-effort operation this is acceptable, but the logged warning should note the task is still live so operators aren't confused. Consider calling task.cancel() after the timeout as well, so the coroutine wrapper also terminates promptly once the thread has stopped.

Fix this →


P1 — Session restore path not covered by preload tests

backend/session_manager.py:499 calls _start_cpu_sandbox_preload on the lazy-restore path inside ensure_session_loaded. The new test (test_create_session_schedules_cpu_sandbox_preload) only monkey-patches the create_session path. A restored session that returns via the started is not agent_session branch (line 492) skips _start_cpu_sandbox_preload entirely — which is intentional, but not explicitly tested. A complementary test for the ensure_session_loaded restore path would confirm the preload fires (and only once) when an inactive session is re-connected.


P2 — Cargo-culted hardware key aliases in _needs_approval

agent/core/agent_loop.py:129-132

hardware = (
    tool_args.get("hardware")
    or tool_args.get("hardware_flavor")   # ← not in sandbox_create schema
    or tool_args.get("flavor")            # ← not in sandbox_create schema
    or DEFAULT_CPU_SANDBOX_HARDWARE
)

sandbox_create only exposes hardware. The hardware_flavor / flavor aliases are copy-pasted from the hf_jobs check directly below. They're harmless (missing keys return None) but confusing. Removing them would make the intent clearer.


Co-authored-by: Codex <codex@openai.com>
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 1, 2026

Claude finished @lewtun's task in 1m 38s —— View job


PR Review: Auto-start CPU sandboxes for sessions (re-review on e19c139)

0 P0
Verdict: ready to merge

This re-review covers only the new commit (e19c139 Retry sandbox runtime visibility checks). Per review policy, previously flagged P1s are not re-posted.


What I checked

  • _is_transient_space_visibility_error helper — response attribute lookup and string-match fallback, no injection surface
  • Sandbox.create wait loop at sandbox_client.py:643-667 — deadline bound (WAIT_TIMEOUT = 600), _check_cancel() still called at loop top, no infinite-retry risk
  • Test test_sandbox_client_retries_transient_runtime_404 in test_sandbox_private_spaces.py:40 — raises 404 on first call, returns RUNNING on second, verifies runtime_calls == 2; adequate coverage for the new path
  • Previously flagged P1s — backend/routes/agent.py:547 fire-and-forget task and cancel_sandbox_preload timeout task leak are not fixed but were already flagged; not re-posted

Status of previously flagged P1s

The three P1s from the prior review were not addressed in this commit. They stand as informational items for a follow-up; they are not re-flagged here.

Co-authored-by: OpenAI Codex <codex@openai.com>
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 1, 2026

Claude finished @lewtun's task in 3m 24s —— View job


PR Review: Auto-start CPU sandboxes for sessions (re-review on 683102a)

  • Gather context and read changed files
  • Analyze latest commit for P0 issues
  • Post final review

0 P0
Verdict: ready to merge

This re-review covers only the new commit (683102a Stabilize auto CPU sandbox creation). Per review policy, previously flagged P1s are not re-posted.


What I checked

  • request_space_hardware call added at sandbox_client.py:630 — correct placement (after duplicate_space, before secrets injection); sleep_time threading is consistent with the duplicate kwargs
  • Hardware-mismatch wait loop at sandbox_client.py:663–672 — loops while runtime.hardware != hardware when RUNNING; SpaceHardware is str, Enum so string comparison with "cpu-basic" is correct; None hardware causes additional waiting but is bounded by WAIT_TIMEOUT=600
  • _get_sandbox_create_lock at sandbox_tool.py:50–57 — asyncio single-thread model makes the dict mutation safe; WeakKeyDictionary keyed on the event loop prevents cross-loop lock sharing
  • Double-checked locking in _ensure_sandbox at sandbox_tool.py:223–225 — correct: inner session.sandbox check re-runs under the lock; two sessions for same owner each get their own sandbox (second session's session.sandbox is still None, proceeds correctly)
  • _create_sandbox_locked cancel-after-success path at sandbox_tool.py:305–311sb is always bound if this block is reached (non-Cancelled exception would propagate before reaching the check); cleanup and early return are correct
  • Orphan cleanup removal from creation path — intentional; _cleanup_user_orphan_sandboxes is still defined but no longer called at creation time; the monkeypatch in test_sandbox_creation_is_serialized_per_owner is a no-op
  • list_spaces(full=True) fix at sandbox_tool.py:147 — required to populate lastModified; last_mod is None guard at line 163 prevents deleting spaces with missing timestamps
  • New tests — test_orphan_sweep_preserves_spaces_without_last_modified and test_sandbox_creation_is_serialized_per_owner cover both new behaviors; max_active_creates == 1 assertion correctly verifies serialization

Status of previously flagged P1s

The three P1s from the prior review (agent.py:547 fire-and-forget GC, cancel_sandbox_preload task leak on timeout, session restore preload test coverage) were not addressed in this commit. They stand as informational items for a follow-up; not re-flagged here.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 1, 2026

Claude finished @lewtun's task in 3m 46s —— View job


PR Review: Auto-start CPU sandboxes for sessions (re-review on 5543a34)

  • Gather context and read changed files
  • Analyze latest commits for P0 issues
  • Post final review

0 P0
Verdict: ready to merge

This re-review covers only the new commit (5543a34 Merge branch 'main' into codex/auto-start-cpu-sandbox), which pulls the YOLO auto-approval budget feature from main. Per review policy, previously flagged P1s are not re-posted.


What I checked

  • YOLO × cpu-basic sandbox interaction (agent_loop.py:144-165, cost_estimation.py:46-47): _is_budgeted_auto_approval_target returns True for all sandbox_create calls, but SPACE_PRICE_USD_PER_HOUR["cpu-basic"] = 0.0 makes billable=False and estimated_cost_usd=0.0. The budget path is entered but never charges the cap; no regression.
  • YOLO × GPU sandbox upgrade path: _base_needs_approval returns True for non-cpu hardware; _approval_decision correctly routes through the budget check before auto-approving; sandbox_create_handler:594-598 tears down the CPU sandbox before GPU creation — sequence is correct.
  • Auto-preload bypasses approval entirely: start_cpu_sandbox_preload calls _ensure_sandbox directly, so no _approval_decision call and no budget deduction for the implicit CPU sandbox. Correct.
  • PATCH /session/{session_id}/yolo auth chain (routes/agent.py:502-522): _check_session_access + Depends(get_current_user) guard is consistent with other write routes. No auth gap.
  • update_session_auto_approval (session_manager.py:933-968): cap defaults (DEFAULT_YOLO_COST_CAP_USD = 5.0) applied correctly; persist_session_snapshot called after mutation — no state-loss path.
  • Merge resolution of session.py: preload attributes (sandbox_hardware, sandbox_preload_task, etc.) and YOLO attributes (auto_approval_enabled, auto_approval_cost_cap_usd) coexist correctly at lines 119–129; no collision.

Status of previously flagged P1s

The three P1s from the prior reviews (agent.py:568 fire-and-forget GC, cancel_sandbox_preload task leak on timeout, session restore preload test coverage) were not addressed in this commit. They stand as informational items for a follow-up; not re-flagged here.

Co-authored-by: OpenAI Codex <codex@openai.com>
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 1, 2026

Claude finished @lewtun's task in 2m 12s —— View job


PR Review: Auto-start CPU sandboxes for sessions (re-review on 06a4326)

  • Gather context and read changed files
  • Analyze latest commit for previously flagged P1s and new issues
  • Post final review

0 P0
Verdict: ready to merge

This re-review covers only the new commit (06a4326 Address sandbox PR review comments). All three P1s flagged in prior reviews are now fixed.


What I checked

  • P1 fix — fire-and-forget GC (backend/routes/agent.py:569–571): module-level _background_teardown_tasks: set[asyncio.Task] added; task is inserted before the done callback fires and removed via discard on completion. The ordering is safe under asyncio's single-threaded model — the scheduled coroutine won't execute until the current frame yields.
  • P1 fix — cancel_sandbox_preload task leak (sandbox_tool.py:400–405): task.cancel() is now called after the 30-second timeout and the log message correctly states the task was still live. Traced through the test (test_cancel_sandbox_preload_cancels_task_after_timeout): fake_wait_for raises TimeoutError, task.cancel() is called, await asyncio.sleep(0) lets the loop process the cancellation; task.cancelled() and cancel_event.is_set() assertions both pass.
  • P1 fix — session restore path coverage (test_session_manager_persistence.py:234–258): test_lazy_restore_schedules_cpu_sandbox_preload patches _start_cpu_sandbox_preload and asserts it fires for the restored session. test_concurrent_lazy_restore_starts_only_one_agent_task was also updated to assert scheduled == ["persisted-session"], confirming the preload fires exactly once even under concurrent restores.
  • P2 fix — hardware alias cleanup (agent/core/agent_loop.py:159): hardware_flavor and flavor aliases removed; now just tool_args.get("hardware") or DEFAULT_CPU_SANDBOX_HARDWARE. Clean.
  • No new code paths introduced that could carry new P0 risk.

Status of previously flagged P1s

All three prior P1s are resolved. No outstanding informational items.


For non-trivial scripts, develop and test in a sandbox before launching via hf_jobs:
sandbox_create → install deps → write script → test with small run → fix errors → launch via hf_jobs at scale
A private cpu-basic sandbox is already available for normal code execution in each session. For non-trivial scripts, develop and test there before launching via hf_jobs:
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lewtun lewtun merged commit 8615c28 into main May 1, 2026
3 checks passed
@lewtun lewtun deleted the codex/auto-start-cpu-sandbox branch May 1, 2026 18:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant