Extensible voice conversation SDK for building ChatGPT/Claude-style voice modes in any application. Provider-agnostic STT (Speech-to-Text) and TTS (Text-to-Speech) with a pluggable architecture.
Start with AWS, extend to Azure, ElevenLabs, Google Cloud, or any custom provider.
- Provider Abstraction — Swap between AWS Transcribe, AWS Polly, browser native, or custom providers with a single config change
- Voice Pipeline — Full VAD → STT → LLM → TTS → Playback orchestration with barge-in support
- React Components — Ready-to-use VoiceOverlay (phone-screen style), MicrophoneButton, and AudioVisualizer
- Express Middleware — Server routes for proxying STT/TTS (keeps credentials server-side)
- Sentence-Level Streaming TTS — Plays audio sentence-by-sentence during LLM streaming for minimal latency
- Zero Idle Cost — AWS services are purely pay-per-use ($0 when not in use)
- Lightweight — Only loads provider SDKs you actually use (peer dependencies)
npm install @illuma-ai/voice# AWS providers (recommended)
npm install @aws-sdk/client-polly @aws-sdk/client-transcribe-streaming
# React components
npm install react react-dom
# Express server middleware
npm install expressimport {
createSTTProvider,
createTTSProvider,
createVoicePipeline,
} from '@illuma-ai/voice';
// Create providers
const stt = createSTTProvider('aws-transcribe', { region: 'us-east-1' });
const tts = createTTSProvider('aws-polly', { region: 'us-east-1' });
// Create voice pipeline
const pipeline = createVoicePipeline({
stt,
tts,
sttConfig: { languageCode: 'en-US' },
ttsConfig: { voiceId: 'Joanna' },
onSubmit: async (text) => {
// Submit to your LLM and return a streaming response
const response = await fetch('/api/chat', {
method: 'POST',
body: JSON.stringify({ message: text }),
});
const reader = response.body!.getReader();
const decoder = new TextDecoder();
return {
stream: {
async *[Symbol.asyncIterator]() {
while (true) {
const { done, value } = await reader.read();
if (done) break;
yield decoder.decode(value);
}
},
},
abort: () => reader.cancel(),
};
},
});
// Start listening
await pipeline.start();
// Stop
pipeline.stop();import {
useVoiceMode,
VoiceOverlay,
MicrophoneButton,
} from '@illuma-ai/voice/react';
import { createSTTProvider, createTTSProvider } from '@illuma-ai/voice';
function ChatInput() {
const voice = useVoiceMode({
stt: createSTTProvider('aws-transcribe', { region: 'us-east-1' }),
tts: createTTSProvider('aws-polly', { region: 'us-east-1' }),
sttConfig: { languageCode: 'en-US' },
ttsConfig: { voiceId: 'Joanna' },
onSubmit: submitToLLM,
});
return (
<div>
<textarea />
<MicrophoneButton
isActive={voice.isOpen}
onClick={voice.open}
/>
<VoiceOverlay {...voice} />
</div>
);
}import express from 'express';
import { createVoiceRouter } from '@illuma-ai/voice/server';
import { createSTTProvider, createTTSProvider } from '@illuma-ai/voice';
const app = express();
app.use(express.json());
app.use(
'/api/voice',
createVoiceRouter({
stt: createSTTProvider('aws-transcribe', {
region: process.env.AWS_REGION!,
}),
tts: createTTSProvider('aws-polly', {
region: process.env.AWS_REGION!,
}),
defaultTTSConfig: {
voiceId: 'Joanna',
outputFormat: 'mp3',
},
}),
);
app.listen(3000);@illuma-ai/voice
├── providers/ # Provider abstraction layer
│ ├── types.ts # ISTTProvider, ITTSProvider interfaces
│ ├── stt/
│ │ ├── aws-transcribe.ts # AWS Transcribe Streaming
│ │ └── browser.ts # Web Speech API (free fallback)
│ ├── tts/
│ │ ├── aws-polly.ts # AWS Polly Neural
│ │ └── browser.ts # SpeechSynthesis (free fallback)
│ └── factory.ts # createSTTProvider(), createTTSProvider()
│
├── pipeline/ # Voice conversation engine
│ ├── voice-pipeline.ts # Full STT → LLM → TTS orchestrator
│ ├── vad.ts # Voice Activity Detection
│ ├── audio-capture.ts # Microphone access + recording
│ ├── audio-player.ts # Queue-based playback + barge-in
│ └── sentence-splitter.ts # Stream text → sentences for TTS
│
├── server/ # Express middleware
│ └── routes.ts # /transcribe, /synthesize, /voices, /health
│
└── client/ # React hooks + components
├── hooks/
│ ├── useVoiceMode.ts # Full voice conversation state
│ ├── useSTT.ts # Standalone STT hook
│ ├── useTTS.ts # Standalone TTS hook
│ └── useVAD.ts # Voice activity detection
└── components/
├── VoiceOverlay.tsx # Full-screen voice UI
├── MicrophoneButton.tsx # Mic icon for chat input
└── AudioVisualizer.tsx # Animated orb visualization
Implement the ISTTProvider or ITTSProvider interface and register:
import {
registerSTTProvider,
registerTTSProvider,
type ISTTProvider,
type ITTSProvider,
} from '@illuma-ai/voice';
// Example: ElevenLabs TTS
class ElevenLabsTTS implements ITTSProvider {
readonly name = 'elevenlabs';
async synthesize(text, config) {
const response = await fetch(
`https://api.elevenlabs.io/v1/text-to-speech/${config.voiceId}`,
{
method: 'POST',
headers: {
'xi-api-key': this.apiKey,
'Content-Type': 'application/json',
},
body: JSON.stringify({ text }),
},
);
return response.arrayBuffer();
}
async getVoices() { /* ... */ }
destroy() {}
}
// Register so it can be used via factory
registerTTSProvider('elevenlabs', (config) => new ElevenLabsTTS(config));
// Now use it
const tts = createTTSProvider('elevenlabs', { apiKey: '...' });interface ISTTProvider {
readonly name: string;
transcribe(audio: Blob | Buffer, config: STTConfig): Promise<string>;
startStreaming(config: STTConfig): Promise<STTStreamSession>;
destroy(): void;
}interface ITTSProvider {
readonly name: string;
synthesize(text: string, config: TTSConfig): Promise<ArrayBuffer>;
getVoices(): Promise<Voice[]>;
destroy(): void;
}Both services are purely pay-per-use with zero idle cost. Safe to leave enabled permanently.
| Service | Pricing | Free Tier |
|---|---|---|
| AWS Transcribe Streaming | $0.024/min | 60 min/month (12 months) |
| AWS Polly Neural | $16.00/1M chars | 1M chars/month (12 months) |
@illuma-ai/voice auto-discovers AWS credentials from your project's environment. If your project already uses AWS services (Bedrock, S3, etc.), voice mode will work with zero additional configuration.
The SDK checks these env vars in order (first found wins):
| Priority | Region | Access Key | Secret Key |
|---|---|---|---|
| 1 (voice-specific) | VOICE_AWS_REGION |
VOICE_AWS_ACCESS_KEY_ID |
VOICE_AWS_SECRET_ACCESS_KEY |
| 2 (standard AWS) | AWS_REGION |
AWS_ACCESS_KEY_ID |
AWS_SECRET_ACCESS_KEY |
| 3 (Bedrock fallback) | BEDROCK_AWS_DEFAULT_REGION |
BEDROCK_AWS_ACCESS_KEY_ID |
BEDROCK_AWS_SECRET_ACCESS_KEY |
| Default | us-east-1 |
— | — |
# If you already have these, voice mode is already enabled:
AWS_REGION=us-east-1
AWS_ACCESS_KEY_ID=your-key
AWS_SECRET_ACCESS_KEY=your-secret# TTS defaults
VOICE_TTS_VOICE_ID=Joanna # AWS Polly voice (default: Joanna)
VOICE_TTS_OUTPUT_FORMAT=mp3 # mp3, pcm, ogg_vorbis (default: mp3)
VOICE_TTS_SAMPLE_RATE=24000 # Hz (default: 24000)
# STT defaults
VOICE_STT_LANGUAGE=en-US # Language code (default: en-US)
VOICE_STT_SAMPLE_RATE=16000 # Hz (default: 16000)
VOICE_STT_ENCODING=pcm # pcm, flac, ogg-opus (default: pcm)
VOICE_STT_PARTIAL_RESULTS=true # Enable interim results (default: true)import { loadEnvConfig, createSTTProvider, createTTSProvider } from '@illuma-ai/voice';
// Auto-discover from env
const config = loadEnvConfig();
// Or pass credentials directly (overrides env)
const config = loadEnvConfig({
aws: {
region: 'eu-west-1',
credentials: { accessKeyId: '...', secretAccessKey: '...' },
},
tts: { voiceId: 'Matthew' },
stt: { languageCode: 'es-ES' },
});
if (config.enabled) {
const stt = createSTTProvider('aws-transcribe', config.aws);
const tts = createTTSProvider('aws-polly', config.aws);
}- Ensure
AWS_ACCESS_KEY_IDandAWS_SECRET_ACCESS_KEYare in your.env - No AWS services need to be pre-provisioned — Transcribe and Polly are serverless
loadEnvConfig().enabledwill returntruewhen credentials are found- Costs are purely pay-per-use ($0 idle) — safe to leave always enabled
| Import Path | Contents |
|---|---|
@illuma-ai/voice |
Providers, factory, pipeline (universal) |
@illuma-ai/voice/server |
Express middleware (Node.js only) |
@illuma-ai/voice/react |
React hooks and components (browser only) |
MIT