Skip to content

Sub-agent with sequential LRO tools fails to resumeΒ #5349

@tony-lenski

Description

@tony-lenski

πŸ”΄ Required Information

Describe the Bug:

When a sub-agent has multiple LRO tools that are called sequentially, the first HITL pause/resume works correctly, but the second resumption fails β€” the runner early-exits with no LLM content. Two independent bugs combine to cause this.

Steps to Reproduce:

  1. Create a resumable app with an orchestrator LlmAgent that delegates to a sub-agent
  2. The sub-agent has two LongRunningFunctionTools (e.g., a picker and a confirmation step)
  3. User triggers the first LRO tool β†’ agent pauses β†’ user responds β†’ agent resumes correctly
  4. Agent calls the second LRO tool β†’ agent pauses β†’ user responds β†’ agent fails to resume

See Minimal Reproduction Code below for a copy-paste-runnable script.

Expected Behavior:

Both resumptions should work identically β€” the sub-agent receives the tool response, the LLM is called, and it generates text or further tool calls.

Observed Behavior:

The first resumption works. The second resumption produces no LLM-generated content β€” the agent silently stops.

Environment Details:

  • ADK Library Version: 1.30.0 (bug present in 1.27.0–1.30.0, worked in 1.26.0)
  • Desktop OS: macOS
  • Python Version: 3.13+

Model Information:

  • Are you using LiteLLM: No
  • Which model is being used: gemini-2.0-flash (not model-dependent β€” bug is in the runner/agent framework)

🟑 Optional Information

Regression:

Yes β€” worked in ADK 1.26.0, broken since 1.27.0 (the _resolve_invocation_id rewrite).

In 1.26.0, if no invocation_id was passed, the runner always created a new invocation β€” no stale flags to trip over. In 1.27.0, _resolve_invocation_id (runners.py:356-383) auto-infers invocation_id from the FunctionResponse by searching session.events for the matching FunctionCall. This forces the resumed-invocation path, which replays stale flags from the previous pause.

Additional Context:

Root cause

The first resumption works because the orchestrator's initial run exits via should_pause (line 494), which does not set end_of_agent. The problems appear on the second resumption, when the orchestrator enters the sub-agent resume block (line 474-483) for the first time.

Bug 1 β€” should_pause_invocation ignores existing responses

should_pause_invocation (invocation_context.py) returns True if any FunctionCall ID is in long_running_tool_ids β€” without checking whether a FunctionResponse already exists. On the second resumption, the already-answered first LRO still triggers the pause guard in base_llm_flow.py:838-851, and the sub-agent's LLM flow exits immediately.

Bug 2 β€” Premature end_of_agent on the orchestrator

After the first resumption, the orchestrator enters the sub-agent resume block (line 474-483) and unconditionally sets end_of_agent=True β€” even when the sub-agent only paused for the second LRO (not truly completed). On the second resumption, populate_invocation_agent_states replays this stale flag and the runner early-exits at runners.py:597.

Trigger conditions

All three are required:

  1. ResumabilityConfig(is_resumable=True)
  2. Root LlmAgent delegates to a sub-agent that has LRO tools
  3. The sub-agent calls LRO tools sequentially (two or more pause/resume cycles in the same invocation)

Suggested fix

Bug 1 is addressed by PR #5072 (has_unresolved_long_running_tool_calls), open since 2026-03-30. The fix below is only for Bug 2, which #5072 does not cover.

Bug 2 β€” The orchestrator should only set end_of_agent=True when the sub-agent truly completed, not when it paused for another LRO:

Diff for llm_agent.py (sub-agent resume block, ~line 474)
   if agent_state is not None and (
       agent_to_transfer := self._get_subagent_to_resume(ctx)
   ):
+    sub_agent_paused = False
     async with Aclosing(agent_to_transfer.run_async(ctx)) as agen:
       async for event in agen:
+        # Requires Bug 1 fix β€” should_pause_invocation must be response-aware
+        if ctx.should_pause_invocation(event):
+          sub_agent_paused = True
         yield event

-    ctx.set_agent_state(self.name, end_of_agent=True)
-    yield self._create_agent_state_event(ctx)
+    if not sub_agent_paused:
+      ctx.set_agent_state(self.name, end_of_agent=True)
+      yield self._create_agent_state_event(ctx)
     return

Backward-compatible β€” non-LRO sub-agent completions still set the flag exactly as before. The check is skipped entirely when is_resumable is False or when the root agent calls LRO tools directly (no sub-agent delegation), so existing setups are unaffected.

Verified: with both #5072 (Bug 1) and this fix (Bug 2) applied to ADK 1.30.0, all sequential resumptions work correctly.

Minimal Reproduction Code:

import asyncio
from google.adk.agents import LlmAgent
from google.adk.apps import App, ResumabilityConfig
from google.adk.runners import Runner
from google.adk.sessions import InMemorySessionService
from google.adk.tools import LongRunningFunctionTool
from google.genai import types


async def select_item(tool_context) -> dict:
    """Step 1: user picks an item (blocks for user input)."""
    return {"status": "pending"}


async def confirm_choice(tool_context) -> dict:
    """Step 2: user confirms the choice (blocks for user input)."""
    return {"status": "pending"}


sub_agent = LlmAgent(
    name="picker",
    model="gemini-2.0-flash",
    instruction=(
        "First call select_item to let the user pick. "
        "After they respond, call confirm_choice to confirm. "
        "After both are done, summarize what was picked and confirmed."
    ),
    tools=[
        LongRunningFunctionTool(func=select_item),
        LongRunningFunctionTool(func=confirm_choice),
    ],
)
root_agent = LlmAgent(
    name="orchestrator",
    model="gemini-2.0-flash",
    instruction="Delegate to the picker agent.",
    sub_agents=[sub_agent],
)
app = App(
    name="repro",
    root_agent=root_agent,
    resumability_config=ResumabilityConfig(is_resumable=True),
)
session_service = InMemorySessionService()
runner = Runner(app=app, session_service=session_service)


async def main():
    session = await session_service.create_session(app_name="repro", user_id="u")

    # Step 1: agent delegates β†’ picker calls select_item β†’ pauses for user input
    step1_events = []
    async for event in runner.run_async(
        user_id="u",
        session_id=session.id,
        new_message=types.Content(
            role="user", parts=[types.Part(text="Pick something")]
        ),
    ):
        step1_events.append(event)

    fc1_id = next(
        fc.id
        for e in step1_events
        for fc in e.get_function_calls()
        if fc.name == "select_item" and e.long_running_tool_ids
    )

    # Step 2: user responds β†’ picker resumes β†’ calls confirm_choice β†’ pauses again
    step2_events = []
    async for event in runner.run_async(
        user_id="u",
        session_id=session.id,
        new_message=types.Content(
            role="user",
            parts=[
                types.Part(
                    function_response=types.FunctionResponse(
                        id=fc1_id,
                        name="select_item",
                        response={"result": "option_a"},
                    )
                )
            ],
        ),
    ):
        step2_events.append(event)

    fc2_id = next(
        fc.id
        for e in step2_events
        for fc in e.get_function_calls()
        if fc.name == "confirm_choice" and e.long_running_tool_ids
    )

    # Step 3: user confirms β†’ BUG: no LLM content, agent silently stops
    step3_events = []
    async for event in runner.run_async(
        user_id="u",
        session_id=session.id,
        new_message=types.Content(
            role="user",
            parts=[
                types.Part(
                    function_response=types.FunctionResponse(
                        id=fc2_id,
                        name="confirm_choice",
                        response={"confirmed": True},
                    )
                )
            ],
        ),
    ):
        step3_events.append(event)

    has_llm_content = any(
        e.get_function_calls()
        or (
            e.content
            and e.content.parts
            and any(p.text for p in e.content.parts)
        )
        for e in step3_events
        if e.author != "user"
    )
    assert has_llm_content, (
        f"BUG: Step 3 produced {len(step3_events)} events but no LLM content"
    )


asyncio.run(main())

How often has this issue occurred?:

  • Always (100%) β€” deterministic with sequential LRO tools in a sub-agent

Related Issues:

Metadata

Metadata

Assignees

Labels

core[Component] This issue is related to the core interface and implementation

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions