You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Analyze vocal signals (speaking pace, volume, pause frequency) to adapt the avatar's communication style in real time — speaking more slowly and reassuringly when the user sounds uncertain, matching energy when they're excited, offering to pause when they sound overwhelmed. The avatar responds to emotional context through behavioral adaptation, not by labeling emotions.
Market Signal
HeyGen Avatar IV (2026) introduced a diffusion-based audio-to-expression engine that maps vocal tone, rhythm, and emotion to facial movements with natural lip-sync and micro-expressions across 175+ languages — proving that audio-to-emotional-response is technically mature. D-ID V4 Expressive Visual Agents (March 2026) ship sentiment-aligned facial expressions in real-time conversations with sub-500ms latency. ElevenLabs Eleven v3 adds emotion and delivery control to TTS output, enabling the responding avatar to modulate its own voice emotionally. The industry is moving beyond functional interaction toward emotionally intelligent conversation.
User Signal
TalkTerm's PRD defines the avatar as a 'companion, not tool' (Experience Principle 1) and identifies 'Trust & Safety' as a primary emotional goal. The desired emotional response includes 'Effortless Progress' and 'Wonder & Excitement.' But the current avatar states (listening, thinking, speaking per FR3) are purely functional — they communicate what the agent is doing, not how it relates to the user's emotional state. For non-technical users who may feel anxiety about delegating consequential actions to AI, an avatar that adapts its communication style to their emotional context could meaningfully accelerate trust formation.
Technical Opportunity
The Web Audio API (available in Electron's Chromium renderer) provides real-time audio stream analysis before STT processing. Basic vocal signal analysis — amplitude for volume, timing between words for pace, silence duration for pauses — is computationally trivial (FFT + simple statistics, no ML model needed for MVP). The Rive state machine already supports parameterized inputs (setInputState) — adding continuous expression parameters (e.g., expressionIntensity: 0.0-1.0, communicationPace: 'calm'|'normal'|'energetic') requires additional animation states but no architecture changes. The TTS abstraction layer can pass emotion/pace hints to providers like ElevenLabs that support delivery control.
Assessment
Dimension
Score
Rationale
Feasibility
med
Audio signal analysis is simple, but designing effective Rive animation variants for emotional expression requires significant animation design work. TTS emotion control depends on cloud provider integration (Phase 2).
Impact
high
Directly serves the 'companion, not tool' experience principle. Could be TalkTerm's most memorable differentiator — the avatar that *gets* you.
Urgency
med
Phase 2 feature — requires working avatar and voice pipeline (Epic 3) first. But the Rive state machine design (Story 3.1) should anticipate expression parameters from day one.
Adversarial Review
Strongest objection: Detecting emotion from audio is unreliable. Different cultures express emotions differently. False positives could be patronizing — an avatar that slows down when the user is simply thinking carefully, or becomes energetic when the user is just speaking loudly in a noisy room.
Rebuttal: The proposal explicitly avoids emotion classification (no 'you seem frustrated' moments). It uses three mechanically reliable vocal signals: speaking pace (words/minute), volume (amplitude), and pause frequency (silence ratio). These drive subtle communication style shifts — the avatar mirrors the user's pace, adjusts its own speaking speed, and modulates option density (fewer choices when pauses suggest cognitive load). If the adaptation is wrong, the user doesn't notice because the shifts are subtle by design. Start with pace-matching only (mirror the user's speaking speed in TTS output) — the simplest and most reliably beneficial behavior — and expand from there based on user testing.
Suggested Next Step
Prototype vocal signal detection (pace, volume, pauses) using Web Audio API in an Electron test harness. Define 3-4 avatar communication style presets (calm/measured, normal, energetic, gentle/reassuring) and map them to Rive state machine parameters. Test with 5 non-technical users for perceived naturalness vs. perceived patronizing behavior. Results inform whether to proceed or pivot the approach.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
Summary
Analyze vocal signals (speaking pace, volume, pause frequency) to adapt the avatar's communication style in real time — speaking more slowly and reassuringly when the user sounds uncertain, matching energy when they're excited, offering to pause when they sound overwhelmed. The avatar responds to emotional context through behavioral adaptation, not by labeling emotions.
Market Signal
HeyGen Avatar IV (2026) introduced a diffusion-based audio-to-expression engine that maps vocal tone, rhythm, and emotion to facial movements with natural lip-sync and micro-expressions across 175+ languages — proving that audio-to-emotional-response is technically mature. D-ID V4 Expressive Visual Agents (March 2026) ship sentiment-aligned facial expressions in real-time conversations with sub-500ms latency. ElevenLabs Eleven v3 adds emotion and delivery control to TTS output, enabling the responding avatar to modulate its own voice emotionally. The industry is moving beyond functional interaction toward emotionally intelligent conversation.
User Signal
TalkTerm's PRD defines the avatar as a 'companion, not tool' (Experience Principle 1) and identifies 'Trust & Safety' as a primary emotional goal. The desired emotional response includes 'Effortless Progress' and 'Wonder & Excitement.' But the current avatar states (listening, thinking, speaking per FR3) are purely functional — they communicate what the agent is doing, not how it relates to the user's emotional state. For non-technical users who may feel anxiety about delegating consequential actions to AI, an avatar that adapts its communication style to their emotional context could meaningfully accelerate trust formation.
Technical Opportunity
The Web Audio API (available in Electron's Chromium renderer) provides real-time audio stream analysis before STT processing. Basic vocal signal analysis — amplitude for volume, timing between words for pace, silence duration for pauses — is computationally trivial (FFT + simple statistics, no ML model needed for MVP). The Rive state machine already supports parameterized inputs (
setInputState) — adding continuous expression parameters (e.g.,expressionIntensity: 0.0-1.0,communicationPace: 'calm'|'normal'|'energetic') requires additional animation states but no architecture changes. The TTS abstraction layer can pass emotion/pace hints to providers like ElevenLabs that support delivery control.Assessment
Adversarial Review
Strongest objection: Detecting emotion from audio is unreliable. Different cultures express emotions differently. False positives could be patronizing — an avatar that slows down when the user is simply thinking carefully, or becomes energetic when the user is just speaking loudly in a noisy room.
Rebuttal: The proposal explicitly avoids emotion classification (no 'you seem frustrated' moments). It uses three mechanically reliable vocal signals: speaking pace (words/minute), volume (amplitude), and pause frequency (silence ratio). These drive subtle communication style shifts — the avatar mirrors the user's pace, adjusts its own speaking speed, and modulates option density (fewer choices when pauses suggest cognitive load). If the adaptation is wrong, the user doesn't notice because the shifts are subtle by design. Start with pace-matching only (mirror the user's speaking speed in TTS output) — the simplest and most reliably beneficial behavior — and expand from there based on user testing.
Suggested Next Step
Prototype vocal signal detection (pace, volume, pauses) using Web Audio API in an Electron test harness. Define 3-4 avatar communication style presets (calm/measured, normal, energetic, gentle/reassuring) and map them to Rive state machine parameters. Test with 5 non-technical users for perceived naturalness vs. perceived patronizing behavior. Results inform whether to proceed or pivot the approach.
Beta Was this translation helpful? Give feedback.
All reactions