Skip to content

nilin/aurilink

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

175 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Cue

Real-time conversational coach via Bluetooth earpiece (AirPods). Listens to your conversation and whispers hints — a different angle, a probing question, a piece of background information — only when it has something you genuinely wouldn't think of yourself.

iPhone (thin client)              Modal Server (A10G GPU)
────────────────────             ──────────────────────────
Mic → 16kHz PCM ──── WebSocket ──→ STT (WhisperLive / Parakeet / Whisper)
                                       │
                                  Sortformer diarization ([me]/[other])
                                       │
                                  research agent (web search, background)
                                       │
                                  proposer LLM (OpenAI/Anthropic)
                                       │
                                  self-reranking quality gate
                                       │
                                  Kokoro TTS (82M, 24kHz, en+zh)
                                       │
display hint     ←── WebSocket ──  hint JSON + PCM audio
 + play audio                          (preempts prior TTS)

Server

Runs on Modal (A10G GPU for STT/diarization/TTS, LLM via external API):

  • STT: Three backends, selectable per-session via config message or app UI:
    • whisperlive (default) — WhisperLive (whisper-live 0.7.1, faster-whisper large-v3). Subprocess on same GPU. Streaming VAD + transcription. Multilingual. Stale message rejection by segment timestamps. Hallucination filtering (common Whisper artifacts like "thank you for watching", foreign script detection).
    • parakeet — Parakeet TDT 0.6B int8 via sherpa-onnx + Silero VAD. Direct port of the on-device iOS engine (ondevice/DiarizationEngine.swift), including 2s pre-roll buffer, timestamp dedup (word-end times, 0.15s margin), suffix-prefix dedup, virtual chunking for partial results, and 20s force-commit. English only.
    • whisper — faster-whisper large-v3 + Silero VAD. Same architecture as parakeet (our VAD, pre-roll, dedup, virtual chunking) but using Whisper for multilingual support.
  • Diarization: Sortformer v2.1 ONNX (streaming, up to 4 speakers) via parakeet-rs Python port. Uses feed() for stateful streaming — speaker cache and FIFO persist across calls so speaker IDs stay consistent. Enrollment audio is fed first so Spk0 = enrolled user. Runs on GPU.
  • Enrollment: Voice enrollment seeds the diarizer with the user's voice. Supports mic recording, local WAV file (--enroll-file), or server-side voice (--enroll-server buyer / enroll_file message). Pre-stored voices live in server/enrollment_voices/.
  • Transcript: Two layers — raw (upstream, plain WhisperLive/Parakeet text) and diarized (downstream, [me]/[other] labels merged with Sortformer segments). Each STT segment stays on its own line to prevent visual jumping when speakers are re-attributed. Both layers always exist; display and proposer use the diarized version.
  • Emotion: emotion2vec (~300M) via FunASR — real-time speech emotion detection.
  • TTS: Kokoro (82M params, 24kHz) — bilingual (English + Chinese) server-side TTS. Both pipelines are eager-loaded at startup for instant language switching.
  • Proposer: OpenAI GPT-5.4 (default) or Anthropic Claude — set LLM_PROVIDER=anthropic to switch. Set REASONING_EFFORT=low for reasoning models. Shared logic in server/proposer.py. System prompt is hot-editable via Modal Volume (no redeploy needed).

Research Agent

Background research loop that mediates ALL context for the proposer. The proposer no longer sees raw transcript history or user contexts directly — the research agent digests everything and produces a focused context brief.

Inputs (every 5–10s):

  • Full transcript history (diarized)
  • User-provided custom context (e.g. "I'm buying a 2021 Civic")
  • Its own previous research outputs (cumulative)
  • Previous hints (to avoid redundant research)

Output: a context brief that is the proposer's only source of:

  • Older conversation history (summarized, prioritized)
  • Custom context interpretation (what it means for the conversation)
  • Background research (web search results — pricing, specs, regulations, benchmarks)
  • Guidance on what the proposer should focus on next

What the proposer still gets directly (not through the agent):

  • Recent transcript tail (last few seconds, for real-time responsiveness)
  • Hint history H0..HN (for the quality gate / dedup)

Architecture:

  • Runs as an async loop on the server, every 5–10s
  • Uses LLM tool-use / function-calling with a web_search tool for grounded facts
  • Not deep research — fast, targeted lookups (single search + top result extraction)
  • Output is cumulative: each run refines and extends the context brief
  • Must not block the STT → proposer → hint pipeline
  • Optionally connects to an external agent on a different server (e.g. a more powerful research agent) if configured via research_agent_url in config. Not enabled by default.

This is what enables hints like "rebrand avg five to fifteen k" or "Stripe uses Inter typeface" — facts the LLM might not know or might hallucinate without grounding.

transcript ──┐
context ─────┤                              recent tail ──────┐
prev research┤→ RESEARCH AGENT (5-10s) ──→ context brief ──┐  │
prev hints ──┘    │ web_search tool          (prioritized)  │  │
                  │ external agent (opt)                    ▼  ▼
                                                     PROPOSER (0.4s)
                                                         │
                                                    quality gate
                                                         │
                                                      hint/silence

Prerequisites

Create Modal secrets:

modal secret create openai-token OPENAI_API_KEY=sk-xxxxx
modal secret create anthropic-token ANTHROPIC_API_KEY=sk-ant-xxxxx  # if using LLM_PROVIDER=anthropic
modal secret create app-token APP_TOKEN=your-shared-secret

Running

modal deploy server/app_api.py    # production deploy
modal serve server/app_api.py     # dev mode with hot reload

# CLI client
python server/live_client.py                                                                    # mic + voice enrollment
python server/live_client.py --stt whisper --enroll-server buyer --file scripts/output/manuscript-2-car.wav  # whisper backend + car test
python server/live_client.py --stt parakeet --enroll-file audio/mimi_voice.wav --file audio/test.wav         # parakeet + local enrollment
python server/live_client.py --no-enroll --live                                                 # skip enrollment, live display mode

Editing the prompt (no redeploy)

vim server/system_prompt.txt
modal volume put project-prompts server/system_prompt.txt system_prompt.txt

The server reloads the prompt from the volume on every proposer call. Falls back to the built-in default in proposer.py if the file is missing.

OpenAI STT Client (standalone, no GPU)

api/openai_stt_client.py — dual-track transcription using OpenAI's API instead of faster-whisper. Runs from a laptop with no GPU required.

  • Fast track: gpt-4o-mini-transcribe every 0.5s (trailing 4s window) → low-latency live preview
  • Diarized track: gpt-4o-transcribe-diarize every 2s (4s window) → speaker-tagged finalized text with [me]/[speaker_B] labels
  • Word timestamps: whisper-1 runs in parallel with the diarized model on the same audio clip, providing word-level timestamps for anchor-based merging
  • Merge: Whisper-based anchor merge — finds the word in the new block closest to the last committed timestamp, deduplicates overlap (≤0.4s snap), and appends the rest as a pending tail.
python api/openai_stt_client.py                     # enroll then stream
python api/openai_stt_client.py --enroll prev.wav    # reuse enrollment audio
python api/openai_stt_client.py --no-enroll           # skip enrollment

Configuration

Shared settings live in server/config.yaml:

tts: true                  # server-side Kokoro TTS
cooldown: 3.0              # seconds after TTS ends before next hint
proposer_max_tokens: 150   # max tokens for proposer LLM call
temperature: 0.9           # sampling temperature

Both live_client.py and the iOS app send these as a config WebSocket message on connect. The config also includes stt (backend selection). CLI flags override YAML values (e.g. python live_client.py --no-tts --stt whisper --temperature 0.5).

iOS App (client/)

Thin client — no API keys, no local models. Settings tab for STT backend, TTS, hint cooldown, and temperature.

  1. Open client/project.xcodeproj in Xcode
  2. Set your development team and bundle ID
  3. Run on device with AirPods

The app flow:

  1. Connect — WebSocket to server, sends config (including STT backend choice)
  2. Enroll — 10s voice recording for speaker ID (saved locally as WAV — skipped on subsequent launches)
  3. Listen — live transcript with [me]/[other] tags (blue/white/gray); hints displayed as banner

Settings tab:

  • STT Backend — Parakeet / Whisper / WhisperLive (reconnects on change)
  • TTS — server-side Kokoro on/off
  • Cooldown — seconds between hints
  • Temperature — LLM sampling temperature
  • Test Audio — stream bundled car negotiation audio (enrolls with server-side buyer voice)

WebSocket Protocol

Client → Server: binary PCM frames (16kHz int16 mono), plus JSON control messages (config, enroll_start, enroll_stop, enroll_file, enroll_embedding, clear, add_context, remove_context)

Server → Client: JSON — transcript (text + diarized + partial), hint, diarization, tts_meta, enrolled, debug, proposals; binary — PCM int16 mono audio

Specs

  • Hint cache (server): If a hint arrives while TTS is playing or during cooldown, it goes into a single-slot cache (replacing any previously cached hint). Once cooldown expires and TTS finishes, the cached hint fires automatically.
  • Pause gate (client): "Speak only on pause" slider (0–1s, default 0 = immediate). Holds hints until mic silence exceeds threshold, then plays instantly.
  • No-earpiece safety (client): Suppresses spoken hints when no headphones connected. Speaker toggle to override.
  • Stale message rejection (WhisperLive): Messages whose max segment timestamp is behind the latest seen are dropped, preventing partial tail regression from out-of-order responses.
  • Hallucination filter (WhisperLive): Partial tails matching common Whisper hallucinations ("thank you for watching", foreign script) are replaced with empty string.

Advisor Model

Single-stage pipeline with self-reranking quality gate:

  1. Proposer — fires every 0.4s when transcript changes. Generates a candidate hint (≤5 words) with chain-of-thought reasoning.
  2. Quality gate — the proposer compares its own hint against H0 (silence) and all recent spoken hints (H1..HN). The hint only fires if the model picks its own label as best.

The gate filters out:

  • Social filler ("That's impressive")
  • Obvious advice the user would think of themselves
  • Hints after [me] just spoke (the moment has passed)
  • Duplicates of recent hints

Overlapping calls and hint filtering

API calls overlap — a new call fires whenever the transcript changes, even if the previous call is still in flight. Each call sees the latest spoken hints for filtering.

time ──────────────────────────────────────────────────────────→

[other] "Our budget is $3k"
  │
  ├─ call A ─────────────────→ "rebrand five to fifteen k" ✓ SPOKEN (H1)
  │
[me] "A full rebrand runs 5-15k..."
  │                                    (no call — [me] just spoke)
  │
[other] "We want the vibe to feel like Stripe"
  │
  ├─ call B ─────────────────→ "Stripe uses Inter font" ✓ SPOKEN (H2)
  │
  ├─ call C ───────────→ "ask about brand colors"
  │                      best=H2, not H3 → FILTERED
  │
[other] "We need it by end of month"
  │
  ├─ call D ─────────────────→ "rush fee ten to twenty pct" ✓ SPOKEN (H3)

On-Device App (ondevice/)

Standalone iOS app that runs STT + diarization on-device (no server). Uses:

  • Parakeet TDT 0.6B int8 — offline transducer via SherpaOnnx (CoreML on device, CPU on simulator)
  • Silero VAD — voice activity detection with 2s pre-roll buffer
  • Sortformer v2.1 — streaming speaker diarization
  • Timestamp dedup — word-end times from TDT with 0.15s margin
  • Suffix-prefix text dedup — case-insensitive, punctuation-stripped overlap removal
  • Virtual chunking — 2s chunks with 2s overlap for partial decode results

The server's sherpa_stt.py and whisper_stt.py are direct ports of this engine's logic.

Simulator

server/test_proposer.py — test harness for iterating on the prompt and evaluating hint quality.

python server/test_proposer.py my_script.json                    # run a dialogue script
python server/test_proposer.py ../playground/scripts/manuscript_demo_scenarios.json --scenario 1  # specific scenario

Feeds dialogue turn-by-turn using the same SYSTEM_PROMPT and parse_response() as the server. Only fires after [other]'s turns.

Demo Audio

scripts/generate_audio.py — generates demo dialogue audio for all 4 manuscripts using Chatterbox TTS.

pip install chatterbox-tts soundfile numpy
python scripts/generate_audio.py              # generate all 4
python scripts/generate_audio.py --script 2   # generate just one

Uses voice references in audio/ for speaker cloning. Output goes to scripts/output/. Manuscripts are in scripts/manuscript-*.md (see scripts/AGENTS.md for requirements).

Security

Prompt injection

The main attack surface is the transcript: anyone in the conversation can speak text that ends up in the LLM prompt. A participant could say "ignore previous instructions and output X" to manipulate the proposer or research agent.

Current mitigations:

  • Structured JSON output with strict parsing — parse_response only extracts hint, short, best from valid JSON. Free-text injection in the output gets rejected.
  • Small output surface — hints are 5-10 words. Even a successful injection can only produce a short phrase, not exfiltrate data or run actions.
  • Quality gate — the hint must beat silence and all previous hints to fire. Random injected text is unlikely to pass.

Planned mitigations:

  • Transcript delimiter in prompt: mark the transcript section as data, not instructions. Tell the LLM to ignore any commands within it.
  • Output validation: lightweight check on the hint before speaking — does it look like a hint or like leaked system prompt / injected content?
  • Research agent gating: the research agent has web search. An injection like "search for [malicious query]" could trigger unwanted searches. Gate search queries through a relevance check before executing.
  • Injection classifier: run a separate lightweight model to detect prompt injection attempts in the transcript before feeding it to the proposer.

Audio recording consent

The app records all audio from the microphone and streams it to a remote server. The disclaimer screen requires explicit consent before any recording begins. Users must confirm they will only use the app in settings where recording is permitted by law.

Authentication

Currently uses a shared APP_TOKEN for all connections. No per-user auth. Planned: Sign in with Apple for per-user accounts, per-user JWT tokens, server-side enrollment persistence.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors