Skip to content

v0.17.1

@perrette perrette tagged this 21 May 19:52
Pseudo-streaming:
  - Hysteresis silence detection: LOW (-40 dB) inside an utterance,
    HIGH (-25 dB) to start one. Avoids ambient noise opening spurious
    chunks and trailing-syllable clipping at chunk ends.
  - Commit on every detected silence pause; session audio queue is
    preserved across cuts.
  - 1.5 s minimum chunk duration suppresses whisper hallucinations
    on sub-second clips ("Thanks for watching!" etc).
  - Cross-chunk prompt context: each chunk's transcription feeds the
    next chunk's initial_prompt for capitalization / article gender /
    language stability. Dropped after pauses >1.5 s so a bad chunk
    can't poison the rest of the recording.

whisper-futo:
  - Text-layer non-speech filter ("(music)", "[Applause]", etc).
  - max_tokens cap bounds decoder repetition loops on short chunks.
  - initial_prompt and --words now wired through.
  - Reverted the q8_0 turbo model (large-v3 encoder incompatible
    with ACFT audio_ctx shrinkage).

OpenAI realtime (gpt-realtime-whisper):
  - Stop sending the prompt field — the model rejects it
    server-side. Kwarg stays accepted for plumbing compatibility.
  - Coalesce per-token deltas before yielding so paste_via_clipboard
    on Wayland doesn't race itself. ~400 ms cadence or sentence-final
    punctuation; 200 ms floor between any two flushes.
  - Bypass coalescing entirely under --type-direct (no clipboard, no
    race to defeat).

Session loop:
  - Don't drop queued audio after a silence-cut: slow backends were
    losing the user's next words during the round-trip.

Documentation:
  - docs/backends.md vocabulary table now lists whisper-futo and
    correctly describes realtime's prompt handling.
  - docs/backends.md pseudo-streaming section documents the
    cross-chunk context behaviour and pause-based reset.
  - docs/keyboard.md final note explains realtime delta coalescing
    and why --type-direct bypasses it.
Assets 2
Loading