-
Notifications
You must be signed in to change notification settings - Fork 16
Description
Common code execution approaches (like SWE bench, terminal bench, etc.) require executing code in "sandboxed" environments.
The sandbox requirement is to ensure that the agent doesn't write code that "breaks out" of the sandbox, and it practically ends up being docker
most of the time.
The main challenge is that not all environments support running Docker! You can run Docker on VMs directly, but if you're running a workload on kubernetes, on SLURM, MAST, etc. - you typically can't run Docker due to Docker itself requiring high privileges.
A workaround could be to replace docker
calls with a light shim. Essentially, this means that for an evaluation harness (or as a verifier in Forge) you call something like docker, but isn't docker.
enroot is a tool that turns "traditional container/OS images into unprivileged sandboxes" which has some amount of interoperability with docker. Our CoreWeave friends mention that this is a common sandbox tool that's being used in place of Docker by other shops.
We should practically be able to get very simple code execution in Forge as well. As long as we can get something like this code snippet below running, we should be able to integrate quickly (inspired by verifiers):
from monarch.actor import Actor, endpoint, this_host
import subprocess
import tempfile
import os
from pathlib import Path
class SandboxedCoder(Actor):
def __init__(self, image: str, container_name: str = "sandbox"):
"""
:param image: Path to enroot squashfs image (.sqsh) or container name.
:param container_name: Name of the enroot container instance.
"""
self.image = image
self.container_name = container_name
self._initialized = False
@endpoint
def reset(self):
"""(Re)create a clean container instance from the base image."""
# Remove any old container
subprocess.run(
["enroot", "remove", "-f", self.container_name],
stdout=subprocess.PIPE,
stderr=subprocess.PIPE,
)
# Create new container from image
result = subprocess.run(
["enroot", "create", "--name", self.container_name, self.image],
stdout=subprocess.PIPE,
stderr=subprocess.PIPE,
text=True,
)
if result.returncode != 0:
raise RuntimeError(f"Failed to reset container: {result.stderr}")
self._initialized = True
@endpoint
def execute(self, code: str) -> str:
"""
Execute Python code inside the container.
:param code: Python source code string to execute.
:return: Captured stdout.
"""
if not self._initialized:
raise RuntimeError("Container not initialized. Call reset() first.")
# Write code to a temporary file that we can mount
with tempfile.TemporaryDirectory() as tmpdir:
code_path = Path(tmpdir) / "script.py"
code_path.write_text(code)
# Run the code inside the container, mounting tmpdir
cmd = [
"enroot", "start",
"--mount", f"{tmpdir}:/work",
self.container_name,
"python3", "/work/script.py"
]
result = subprocess.run(
cmd,
stdout=subprocess.PIPE,
stderr=subprocess.PIPE,
text=True,
)
if result.returncode != 0:
raise RuntimeError(f"Execution failed:\n{result.stderr}")
return result.stdout
p = this_host().spawn_procs(per_host={"procs": 1})
sandbox = await p.spawn("coder", SandboxedCoder, "/path/to/ubuntu.sqsh", container_name="py-sandbox")
await sandbox.reset.call_one()
result = sandbox.execute.call_one("print('hello from inside enroot!')")
print(result)