Skip to content

[voice] Add dedicated STT latency and TTS generation speed benchmarks#3442

Merged
makr-code merged 2 commits intodevelopfrom
copilot/improve-stt-latency-tts-speed
Mar 1, 2026
Merged

[voice] Add dedicated STT latency and TTS generation speed benchmarks#3442
makr-code merged 2 commits intodevelopfrom
copilot/improve-stt-latency-tts-speed

Conversation

Copy link
Contributor

Copilot AI commented Mar 1, 2026

The bench_voice_assistant.cpp had a Stubs: 1 annotation and no isolated benchmarks for STT or TTS performance — only full-pipeline (processVoiceCommand) measurements that bundle STT + LLM + TTS together, making it impossible to attribute latency regressions.

Changes

benchmarks/bench_voice_assistant.cpp

STT Latency Benchmarks — exercise STTProcessor::transcribe() directly:

  • BM_STTLatency_Short — 1 s fixed baseline
  • BM_STTLatency_ByDuration — 0.5 / 1 / 5 / 30 / 60 s inputs (latency-vs-duration scaling)
  • BM_STTLatency_WithDiarization — speaker diarization path (1 / 5 / 30 s)
  • BM_STTLatency_StreamingstreamTranscribe() callback path (1 / 5 / 30 s)
  • BM_STTStatisticsgetStatistics() overhead

TTS Generation Speed Benchmarks — exercise TTSProcessor::synthesize() directly:

  • BM_TTSGenSpeed_Short — short phrase baseline
  • BM_TTSGenSpeed_ByLength — 50 / 200 / 500 / 2000 char texts
  • BM_TTSGenSpeed_WithOptions — speed variants 0.5× / 1.0× / 1.5× / 2.0×
  • BM_TTSGenSpeed_StreamingstreamSynthesize() callback throughput
  • BM_TTSAvailableVoices / BM_TTSStatistics — auxiliary overhead

Helpers added:

  • buildWavBlob(duration_ms) — generates a standards-compliant 16-bit PCM WAV of arbitrary duration (noise, deterministic seed) for use as STT input without requiring a real audio file
  • getSTTProcessor() / getTTSProcessor() — lazy-initialised singletons; initialization cost paid once per binary run, not per benchmark iteration

Each benchmark emits items_processed, bytes_processed (where relevant), and domain counters (audio_duration_ms, text_chars, speed_x10) for CI regression tracking.

File header updated: Stubs: 0, version 0.0.33, quality score 100.0.

src/voice/ROADMAP.md

Performance benchmark checklist item updated from [I][P].

Type of Change

  • Bug fix
  • New feature
  • Refactoring
  • Documentation
  • Other:

Testing

  • Unit tests added/updated
  • Integration tests added/updated
  • Manual testing performed

📚 Research & Knowledge (wenn applicable)

  • Diese PR basiert auf wissenschaftlichen Paper(s) oder Best Practices?
    • Falls JA: Research-Dateien in /docs/research/ angelegt?
    • Falls JA: Im Modul-README unter "Wissenschaftliche Grundlagen" verlinkt?
    • Falls JA: In /docs/research/implementation_influence/ eingetragen?

Relevante Quellen:

  • Paper:
  • Best Practice:
  • Architecture Decision:

Checklist

  • Code follows project style guidelines
  • Self-review completed
  • Documentation updated (if needed)
  • No new warnings introduced
Original prompt

This section details on the original issue you should resolve

<issue_title>[voice] Improve STT latency and TTS gen speed benchmarks</issue_title>
<issue_description># [voice] Improve STT latency and TTS gen speed benchmarks

Summary

  • Module: voice
  • Item: Performance benchmarks (STT latency, TTS generation speed)

Implementation Context

  • Roadmap section: Production Readiness Checklist
  • Enhancement hint: Line 14: Open Issues: TODOs: 0, Stubs: 1

Implementation Phases

Phase 1: Voice Pipeline & Session Management (Status: Completed )

  • VoiceAssistant central coordinator for all voice interaction
  • STT processing via Whisper AI (speaker diarization, timestamps)
  • LLM integration via EmbeddedLLM / LlamaWrapper (intent recognition, query generation, response generation)
  • TTS synthesis with audio format output
  • Voice command processing pipeline (audio STT LLM TTS audio)
  • Session state and conversation history management
    Phase 2: Streaming STT & Wake-Word Detection (Status: In Progress )
  • Real-time streaming STT (word-by-word transcription as audio arrives)
  • Wake-word detection for hands-free activation
  • Multi-speaker diarization improvements
    Phase 3: Voice Macros & Browser Streaming (Status: Planned )
  • Voice command macros (user-defined shortcuts to AQL queries)
  • Language detection and automatic locale switching
  • Noise suppression preprocessing (RNNoise integration)
  • WebSocket audio streaming endpoint for browser clients
  • Voice session playback and search in stored transcripts
    Phase 4: Multi-Language TTS & Biometric Authentication (Status: Planned )
  • Multi-language TTS (German, French, Spanish voices)
  • Emotion / sentiment detection from voice tone
  • Voice biometric authentication (speaker verification)
  • Real-time meeting transcription with action-item extraction
  • Integration with telephony systems (SIP / WebRTC)

Related Work

  • No direct overlap identified in pre-check.
  • Relationship is for coordination only; do not expand this issue scope.

Scope Boundary

  • In scope: implement only this roadmap item: "Performance benchmarks (STT latency, TTS generation speed)"
  • Out of scope: any other roadmap checklist item, even from the same module/section
  • Related issues/PRs are references only; implementation remains isolated to this issue
  • If additional work is required, open/link a follow-up issue instead of extending this issue

Mandatory Delivery Workflow (ThemisDB Rules)

  • Phase 0: Existing code review before implementation
    • Identify existing files/symbols/interfaces and document reuse plan
    • Verify no duplicate implementation of existing functionality
    • Record affected files and integration points before coding
  • Scope gate (must pass before coding)
    • Confirm implementation target is only the single issue description
    • Confirm no additional roadmap items are implemented in this issue/PR
    • Any newly discovered work is split into separate linked issue(s)
  • Design and implementation rules reviewed from:
    • docs/analysis/IMPLEMENTATION_GUIDE.md
    • docs/analysis/GPU_ACCELERATION_ADDENDUM.md
    • docs/architecture/MODULAR_ARCHITECTURE_ROADMAP.md
    • docs/architecture/THEMIS_CORE_GUIDE.md
    • docs\en\features\voice_assistant_guide.md
    • docs\en\voice\VOICE_MODULE_GUIDE.md
  • Architecture and compatibility validation
    • Confirm behavior compatibility with existing APIs unless breaking change is explicitly declared
    • Confirm telemetry/logging/metrics integration follows existing module patterns
  • Code review gate before completion
    • Self-review against roadmap acceptance criteria
    • Cross-check for overlap/duplication with existing implementations
  • Validation gate
    • Unit + integration tests updated/added
    • Performance impact measured against baseline

Implementation Tasks

  • VoiceAssistant central coordinator for all voice interaction
  • STT processing via Whisper AI (speaker diarization, timestamps)
  • LLM integration via EmbeddedLLM / LlamaWrapper (intent recognition, query generation, response generation)
  • TTS synthesis with audio format output
  • Voice command processing pipeline (audio STT LLM TTS audio)
  • Session state and conversation history management
  • Real-time streaming STT (word-by-word transcription as audio arrives)
  • Wake-word detection for hands-free activation
  • Multi-speaker diarization improvements
  • Voice command macros (user-defined shortcuts to AQL queries)
  • Language detection and automatic locale switching
  • Noise suppression preprocessing (RNNoise integration)
  • WebSocket audio streaming endpoint for browser clients
  • Voice session playback and search in stored transcripts
  • Multi-language TTS (German, French, Spanish voices)
  • Emotion / sentiment detection from voice tone
  • Voice biometric authentication (speak...

💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.

Co-authored-by: makr-code <150588092+makr-code@users.noreply.github.com>
Copilot AI changed the title [WIP] Improve STT latency and TTS generation speed benchmarks [voice] Add dedicated STT latency and TTS generation speed benchmarks Mar 1, 2026
@makr-code makr-code marked this pull request as ready for review March 1, 2026 18:22
@makr-code makr-code merged commit 6c09d5e into develop Mar 1, 2026
11 checks passed
@makr-code makr-code modified the milestones: v1.0.2, v1.5.0 Mar 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[voice] Improve STT latency and TTS gen speed benchmarks

2 participants