Skip to content

feat(eval): add Modal and Daytona runtime providers for per-rollout cloud sandboxes#423

Merged
jdchawla29 merged 13 commits into
v6from
lukass/modal-daytona-runtimes
Jun 18, 2026
Merged

feat(eval): add Modal and Daytona runtime providers for per-rollout cloud sandboxes#423
jdchawla29 merged 13 commits into
v6from
lukass/modal-daytona-runtimes

Conversation

@lukass16

@lukass16 lukass16 commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Issue

The engine could place rollouts locally, in Docker, on a borrowed substrate, or
HUD-hosted — but not on on-demand cloud sandboxes. We want isolated, parallel
cloud envs (Modal, Daytona) per rollout.

Solution

Two new Providers in hud/eval/runtime.py, same shape as DockerRuntime
(acquire → yield Runtime → tear down), so rollout()/connect()/scheduler are
unchanged:

  • ModalRuntimeSandbox.create per rollout from a pre-built image, control
    channel over raw TCP (unencrypted_ports), terminate on exit. Image resolves once
    behind a lock (from_name, or lazy image= build) so concurrent rollouts can't
    race a build.
  • DaytonaRuntime — sandbox from a snapshot (built once from image= if
    missing), env server in a background session, reached via an asyncssh
    local-forward (Daytona exposes only HTTPS previews; connect() dials tcp://).
    SSH token is internal; users only need DAYTONA_API_KEY. workdir defaults to
    /app (scaffold WORKDIR).

Single user handle is the image/snapshot name. Both exported from hud.eval, gated
behind optional [modal]/[daytona] extras. Adds modal_deploy.py to build+publish
the libero image.

Outcome / Verification

  • Drops in via Taskset.run(runtime=...); no engine/client/protocol changes.
  • Lint clean; new deps are optional extras.
  • Follow-ups: --runtime modal|daytona CLI flag, ws:// transport (drop the SSH
    hop), warm-pool to amortize cold start.

Note

Medium Risk
Touches core eval placement (container/cloud provisioning, credentials, SSH tunneling) and changes DockerRuntime CLI shape; failures are mostly isolated per rollout but misconfiguration could affect parallel batch runs.

Overview
Adds portable per-task launch requirements via RuntimeConfig (image, RuntimeResources, RuntimeLimits) on Task, wired through platform sync and documented in the types reference. Runtime now carries the effective config after provider defaults merge with row-level overrides via with_overrides.

DockerRuntime is refactored around runtime_config (CPU/memory/GPU count → docker run flags) and surfaces unsupported fields with clear errors. ModalRuntime and DaytonaRuntime are new Providers that spin up isolated cloud sandboxes per rollout (Modal TCP tunnels; Daytona env serve + SSH local-forward), both honoring runtime_config where the backend allows. LocalRuntime and HUDRuntime explicitly reject task-level runtime_config for now.

Public exports include the runtime config types; optional [modal] and [daytona] extras gate cloud SDK deps. Provider contract tests cover Docker/Modal/Daytona mapping and validation.

Reviewed by Cursor Bugbot for commit ae79946. Bugbot is set up for automated code reviews on this repo. Configure here.

lukass16 added 2 commits June 17, 2026 05:08
Add ModalRuntime as a Provider alongside DockerRuntime: resolve image once
(from_name or lazy build), create an isolated Sandbox per rollout, expose
the env control channel over raw TCP, terminate on exit. Export from
hud.eval and add optional [modal] extra.
…oxes

Add DaytonaRuntime as a Provider alongside ModalRuntime: resolve snapshot once (build from image if missing), create an isolated sandbox per rollout, start the env server in a background session, reach it via an asyncssh local-forward (Daytona exposes only HTTPS previews, connect dials tcp://), delete on exit. workdir defaults to /app to match the scaffolded Dockerfile.hud. Export from hud.eval and add optional [daytona] extra.
Comment thread hud/eval/runtime.py
lukass16 and others added 5 commits June 17, 2026 06:08
Environment(capabilities=[...]) called add_capability() before _hooks_done
was initialized, raising AttributeError; move the flag init above the loop.
Also apply ruff format to satisfy CI (runtime.py, claude sdk agent, cli init).

Co-authored-by: Cursor <cursoragent@cursor.com>
The env server binds all interfaces inside the sandbox; the tunnel is the
only ingress, so the all-interfaces bind is intentional.

Co-authored-by: Cursor <cursoragent@cursor.com>
…smatch

The default command hardcoded --port 8765 while the SSH forward used the
port arg, so a non-default port left the tunnel pointing at a dead port.
Build the default command from port; an explicit command still overrides.

Co-authored-by: Cursor <cursoragent@cursor.com>
Comment thread hud/eval/runtime.py Outdated
lukass16 and others added 2 commits June 17, 2026 21:44
Add RuntimeConfig to allow tasks to specify runtime images, compute resources, and lifecycle limits. This feature enables more granular control over task execution environments, accommodating varying requirements within the same taskset. Update relevant classes and methods to support this new configuration, including integration into task payloads and validation tests.
@jdchawla29 jdchawla29 force-pushed the lukass/modal-daytona-runtimes branch from 984917b to 420718b Compare June 18, 2026 19:27
Comment thread hud/eval/runtime.py
Comment thread hud/eval/runtime.py
Comment thread hud/eval/runtime.py Outdated
Comment thread hud/eval/runtime.py

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit d3775af. Configure here.

Comment thread hud/eval/runtime.py
@jdchawla29 jdchawla29 merged commit 566ecfe into v6 Jun 18, 2026
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants