Update STT metrics to include token usage and emit gpt-realtime transcription STT token counts #5029
Update STT metrics to include token usage and emit gpt-realtime transcription STT token counts #5029bml1g12 wants to merge 11 commits into
Conversation
…nscription events
…ml1g12/agents into add_gpt_realtime_transcription_metrics
|
thanks for the PR. do you mind rebasing it on top of |
Ah I see, I will do this but it might be a little while from now as I have a backlog of other tasks |
That branch seems to no longer exist nor does it match "main"? Was it merged? |
v1.4.0 was renamed to v1.5.0 :) it's |
OK ill wait and pull in main soon |
it's ready for rebase now |
# Conflicts: # livekit-agents/livekit/agents/metrics/base.py # livekit-agents/livekit/agents/metrics/utils.py
|
@davidzhao OK Ive rebased - ready for re-review |
|
@davidzhao OK to merge? |
Summary
The OpenAI Realtime API's
conversation.item.input_audio_transcription.completedevent carries ausagefield with ASR duration (whisper-1 / gpt-4o-transcribe), billed separately from the realtime model. LiveKit currently ignores this field, so users cannot track transcription metrics viaon_metrics_collected.Per OpenAI's Realtime costs documentation, input transcription is billed at the ASR model's rate (e.g. $0.006 / 1M tokens for whisper-1), separately from the realtime model's audio tokens. I have confirmed with OpenAI support that when using gpt-realtime, the Whisper ASR model is billed per token not per minute, so for cost tracking purposes we need to track at least the audio token counts, however sadly at time of writing OpenAI only emit the duration (UsageTranscriptTextUsageDuration) despite their blog suggesting it should emit the token counts
UsageTranscriptTextUsageTokens. I have made OpenAI support aware of the contradiction between their blog and actual implemented events for cost tracking, but in any case, this PR also implements the handling forUsageTranscriptTextUsageTokensevents if they were to be produced in future. The way it is implemented in this PR means that if OpenAI fix it in future to also emit the relevant audio tokens counts (which are the ones we are most interested in for cost estimation) it will also work for those, enabling livekit users to track their Whisper costs on a per-session basis. As it is today though, this PR just enables livekit users to track the duration of Whisper ASR performed.The
Metadata.model_namefield identifies which transcription model produced the metrics (e.g.whisper-1,gpt-4o-transcribe).Note that I have not emited these metrics as OTEL traces as it seems we currently do not emit STT traces in general, and because for LangFuse to track the cost of these I think I would need to use platform specific attributes (`langfuse.observation.type": "generation") as the OTEL specification does not have a standard attribute for STT token counting. I would be happy to add this as a further improvement is there is interest from the livekit team, but otherwise will just implement it in our own client code.
Changes
STTMetrics(metrics/base.py): Add optionalinput_tokens,output_tokens,total_tokens, andinput_audio_tokensfields. All default toNoneso existing STT plugins are unaffected.realtime_model.py): Extract usage fromconversation.item.input_audio_transcription.completedevents and emitSTTMetricsvia the existingmetrics_collectedevent. Handles both the token-based (UsageTranscriptTextUsageTokens) and duration-based (UsageTranscriptTextUsageDuration) usage variants from the OpenAI SDK.log_metrics(metrics/utils.py): Log token fields for STT metrics when present.UsageCollector(metrics/usage_collector.py): Aggregatestt_input_tokens,stt_output_tokens, andstt_input_audio_tokensinUsageSummary.Design decisions
STTMetricsrather thanRealtimeModelMetrics: The transcription runs on a separate model (whisper/gpt-4o-transcribe) with its own billing rate, so it belongs inSTTMetricswith the model identified viaMetadata.