Skip to content

AgentEngineSandboxCodeExecutor: external-delete recovery catches wrong exception class (genai.ClientError vs api_core.NotFound) #5480

@Number531

Description

@Number531

Summary

AgentEngineSandboxCodeExecutor catches the wrong exception class when attempting to recover from externally-deleted sandboxes, causing sessions to crash instead of silently recreating a new sandbox. Confirmed on ADK v1.27.2; affected code path unchanged through v1.31.1.

Affected file: google/adk/code_executors/agent_engine_sandbox_code_executor.py:103,119 (recovery path around sandboxes.get()).

Relation to other issues: This is a separate, additional bug from the field-name mismatches tracked in #3690. The two can be fixed independently.

Current behavior

When a cached sandbox is externally deleted (TTL expiry, quota recycle, manual cleanup, maintenance), the wrapper tries to detect this via sandboxes.get() and fall back to creating a new sandbox:

# google/adk/code_executors/agent_engine_sandbox_code_executor.py
from google.api_core import exceptions
...
try:
    sandbox = self._get_api_client().agent_engines.sandboxes.get(name=sandbox_name)
    if sandbox is None or sandbox.state != "STATE_RUNNING":
        create_new_sandbox = True
except exceptions.NotFound:
    create_new_sandbox = True

The except clause catches google.api_core.exceptions.NotFound. But the sandboxes.get() call through the Vertex SDK raises google.genai.errors.ClientError (wrapping HTTP 404). The two class hierarchies are disjoint:

google.api_core.exceptions.NotFound → api_core.ClientError → ...
google.genai.errors.ClientError      → genai.APIError      → Exception

Effect: the genai 404 propagates past the except clause, and the session becomes unrecoverable. The next execute_code() call crashes; the session must be manually reset.

Reproducer

# 1. Dispatch one execution; session state now contains sandbox_name.
# 2. Manually delete the sandbox (gcloud or Vertex UI) OR wait for TTL expiry.
# 3. Dispatch another execution in the same session.
# Observed: google.genai.errors.ClientError propagates; session is dead.
# Expected: sandbox silently recreated; execution proceeds normally.

Proposed fix

Add a parallel except clause that catches the genai exception class and checks for 404:

from google.api_core import exceptions as api_core_exc
from google.genai import errors as genai_errors
...
try:
    sandbox = self._get_api_client().agent_engines.sandboxes.get(name=sandbox_name)
    if sandbox is None or sandbox.state != "STATE_RUNNING":
        create_new_sandbox = True
except api_core_exc.NotFound:
    create_new_sandbox = True
except genai_errors.ClientError as exc:
    status = getattr(exc, "code", None) or getattr(exc, "status_code", None)
    if status == 404 or "NOT_FOUND" in str(exc):
        create_new_sandbox = True
    else:
        raise

Reference workaround

utils/sandbox_executor_patched.py#L110-L123PatchedAgentEngineSandboxCodeExecutor subclass in continuous production use since April 2026 with no observed regressions.

Environment

  • ADK: v1.27.2 (reproduces on v1.28.x-v1.31.1 — affected paths unchanged)
  • google-genai: 1.x
  • google-cloud-aiplatform / vertexai: 1.x
  • Python: 3.13
  • Agent Engine resource: any projects/.../reasoningEngines/... parent

Impact

Any long-running session whose sandbox is deleted externally becomes unrecoverable. In production multi-agent workflows, sandbox deletions happen routinely via TTL expiry; this bug turns a transient, recoverable condition into a permanent session failure.

Metadata

Metadata

Assignees

Labels

agent engine[Component] This issue is related to Vertex AI Agent Engine

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions