-
Notifications
You must be signed in to change notification settings - Fork 3.1k
Description
Currently, when an execution run throws an unhandled exception or the system crashes mid-invocation (e.g. timeout, SIGTERM), the BigQueryAgentAnalyticsPlugin only leaves dangling INVOCATION_STARTING and AGENT_STARTING events in the database. There is no AGENT_ERROR or INVOCATION_ERROR emitted or supported.
Because a .COMPLETED event natively carries the status and latency duration, losing it means these crashed calls appear implicitly successful or as dangling threads in basic analytics. Furthermore, because these crashed calls never emit latency metadata, they are excluded from average latency calculations, artificially skewing executive dashboards to look faster and more reliable than the agent system actually is.
This feature is highly impactful for building comprehensive observability pipelines. Without native execution error tracking, we are forced to artificially reconstruct crash statuses via time-boundary SQL logic. This allows severe failures to masquerade as non-events (false positives) and completely breaks accurate system latency reporting.
Describe the Solution You'd Like
- Introduce an
on_agent_error_callback(agent, error)andon_run_error_callback(invocation_context, error)at the framework lifecycle level to catch and broadcast agent-level and end-to-end invocation crashes, similarly to howon_tool_error_callbackandon_model_error_callbackcurrently operate. - Update the
BigQueryAgentAnalyticsPlugin(athttps://github.com/google/adk-python/tree/main/src/google/adk/plugins/bigquery_agent_analytics_plugin.py) to properly ingest and logAGENT_ERRORandINVOCATION_ERRORevents mapped from these new callbacks.
Describe Alternatives You've Considered
Currently, developers must implement custom SQL logic on top of the BigQuery tables to manually flag dangling events. For example, joining STARTING events against COMPLETED events and checking if the time difference exceeds a hardcoded threshold (e.g., > 10 minutes) before classifying it as a generic timeout error.
This is an imprecise workaround since it loses the original python exception stack trace entirely, requires arbitrary time constraints, and doesn't solve the fact that the baseline ADK framework swallowed a fatal error natively.
Proposed API / Implementation
Add on_agent_error_callback and on_run_error_callback interfaces to BasePlugin. Invoke on_run_error_callback inside the base exception handlers of the Runner.run_async()/InvocationContext flow, and invoke on_agent_error_callback for individual sub-agent LlmAgent.run_async() failures.
Then inside BigQueryAgentAnalyticsPlugin, add "AGENT_ERROR" and "INVOCATION_ERROR" to _EVENT_TYPES and map the incoming error trace directly to BigQuery's error_message column.