Skip to content

Built-in Resilience Support for AgentTool #4087

@sarojrout

Description

@sarojrout

Is your feature request related to a problem? Please describe.
Yes. Currently, building resilient multi-agent systems with AgentTool requires significant custom code:

  1. No built-in timeout mechanism - Developers must create custom wrappers to add timeout protection
  2. No automatic fallback - Requires LLM reasoning and prompt engineering to route to alternative agents
  3. No result validation - No way to verify that sub-agent results are complete
  4. Complexity leakage - All sub-agent events are exposed, making it hard to hide internal complexity from users

Example Problem:
When a sub-agent times out or fails, the parent agent must manually handle the error, decide whether to retry, choose an alternative agent, and format user-friendly error messages. This requires:

  • Custom TimeoutAgentTool wrapper
  • Complex prompt engineering for routing
  • Manual error handling logic
  • Additional agents for error recovery

Impact:

  • High barrier to entry for building resilient multi-agent systems
  • Inconsistent error handling across different implementations
  • Difficult to test timeout and failure scenarios
  • Poor user experience when errors occur

Describe the solution you'd like
Add built-in resilience features to AgentTool:

1. Built-in Timeout Support

AgentTool(
    agent=sub_agent,
    timeout=30.0,  # Timeout in seconds
    timeout_handler='error' | 'fallback' | 'retry',  # How to handle timeout
)

2. Automatic Fallback Configuration

AgentTool(
    agent=primary_agent,
    fallback_agent=fallback_agent,
    fallback_on_timeout=True,
    fallback_on_error=True,
    fallback_on_partial_result=False,
)

3. Result Validation

AgentTool(
    agent=sub_agent,
    validate_result=True,
    required_fields=['summary', 'sources'],  # For structured output
    result_validator=lambda r: len(r.get('summary', '')) > 100,
)

4. Event Filtering

AgentTool(
    agent=sub_agent,
    stream_events=True,  # Stream all events
    stream_events=False,  # Only stream final result
    hide_intermediate_steps=True,  # Hide tool calls, show only results
)

5. Partial Result Handling

AgentTool(
    agent=sub_agent,
    handle_partial_results='error' | 'retry' | 'return',  # How to handle
    partial_result_threshold=0.8,  # 80% complete = valid
)

Describe alternatives you've considered

Alternative 1: Custom Wrappers (Current Approach)

Pros:

  • Works today without ADK changes
  • Flexible and customizable
  • Non-breaking

Cons:

  • Requires significant custom code
  • Inconsistent across implementations
  • Hard to maintain
  • High barrier to entry

Alternative 2: Plugin-Based Solution

Pros:

  • Extensible
  • Doesn't require ADK core changes

Cons:

  • Still requires custom code
  • Less discoverable
  • More complex API

Alternative 3: Built-in Support (Proposed)

Pros:

  • Simple, consistent API
  • Low barrier to entry
  • Better developer experience
  • Easier to test

Cons:

  • Requires ADK core changes
  • Need to maintain backward compatibility

Recommendation: Built-in support is the best long-term solution, as it makes resilience patterns a first-class feature.
Additional context

Sample Implementation

I've created a working sample (#4086) that demonstrates:

  • Custom TimeoutAgentTool wrapper
  • Integration with ReflectAndRetryToolPlugin
  • Prompt-based dynamic routing
  • Error recovery patterns

Metadata

Metadata

Assignees

No one assigned

    Labels

    core[Component] This issue is related to the core interface and implementation

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions