Skip to content

Python: [Feature]: Non-streaming mode token usage event/hook #4671

@0x7c13

Description

@0x7c13

Description

For streaming mode, I could fetch ["usage"] from update.contents to get real time per tool call actual token usage for the inner chat/agent run.
But there is no clean way to get per tool call usage event in non-streaming mode. The only way to do it right now is via a monkey patch hook like below.
Any plan to support such feature in a public way?

Code Sample

def attach_per_call_usage_hook(
    client: Any,
    tracker: UsageTrackingMiddleware,
) -> None:
    """Monkey-patch a chat client to fire usage callbacks after each model call.

    This wraps ``client._inner_get_response()`` so that for **non-streaming**
    calls, after the underlying API returns a ``ChatResponse``, the tracker's
    ``_handle_usage`` is called immediately with ``dict(response.usage_details)``.

    Also sets ``tracker._per_call_hook_active = True`` so the middleware's
    ``process()`` avoids firing duplicate callbacks for the final response.

    In the agent-framework MRO::

        ChatMiddlewareLayer [once]
          → FunctionInvocationLayer [tool loop]
            → ChatTelemetryLayer [per-call]   ← same level
              → BaseChatClient.get_response()
                → _inner_get_response()       ← we wrap here

    This gives us the same per-iteration accuracy as OpenTelemetry spans
    without requiring otel to be enabled.

    Streaming calls are left unmodified — use ``ResponseStream`` hooks instead.
    """
    tracker._per_call_hook_active = True
    callback = tracker._handle_usage
    original = client._inner_get_response

    def _wrapped(*, messages: Any, stream: bool, options: Any, **kwargs: Any) -> Any:
        result = original(messages=messages, stream=stream, options=options, **kwargs)
        if stream:
            return result  # Streaming handled by ResponseStream hooks

        # Wrap the awaitable to intercept the response
        async def _intercept() -> Any:
            response = await result
            if response and hasattr(response, "usage_details") and response.usage_details:
                callback(dict(response.usage_details))
            return response

        return _intercept()

    client._inner_get_response = _wrapped

Language/SDK

Python

Metadata

Metadata

Labels

Projects

Status

In Review

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions