-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
🚀 Describe the new functionality needed
While implementing the Bedrock provider (#3748), I found that streaming requests don't collect token usage metrics by default. Fixed it for Bedrock by adding stream_options = {"include_usage": True} when telemetry is active. As pointed out by @mattf , this should be in the OpenAIMixin base class so all OpenAI-compatible providers get streaming metrics automatically - not just Bedrock. Right now the code's in BedrockInferenceAdapter.openai_chat_completion() :
Enable streaming usage metrics when telemetry is active
if params.stream and get_current_span() is not None:
if params.stream_options is None:
params.stream_options = {"include_usage": True}
elif "include_usage" not in params.stream_options:
params.stream_options = {**params.stream_options, "include_usage": True}
Should move this into OpenAIMixin.openai_chat_completion() in src/llama_stack/providers/utils/inference/openai_mixin.py so others provide get it
💡 Why is this needed? What if we don't build it?
Without this, streaming requests have blind spots in telemetry - we can track tokens for non-streaming but not streaming.
Makes it hard to:
- Monitor production costs accurately (streaming is common in chat apps)
- Debug performance issues (can't see if a streaming request is token-heavy)
- Set up proper rate limiting based on actual usage
- The include_usage parameter isn't obvious from OpenAI docs and easy to miss.
If we don't standardize this, every new provider implementer has to discover it themselves. Also creates inconsistency - some providers would have streaming metrics, others won't. Since we already check get_current_span() is not None to detect if telemetry's enabled, there's no performance cost when telemetry is off.
Other thoughts
Thanks @mattf for pointing this out.