-
Notifications
You must be signed in to change notification settings - Fork 497
Description
Please read this first
- Have you read the docs? Yes – Agents SDK docs
- Have you searched for related issues? Yes – no existing report matched this behavior.
Describe the bug
When a realtime session config includes the top-level voice field—whether a user supplies it directly or it is injected automatically when you instantiate a RealtimeAgent with voice—the config converter flags the entire payload as legacy. As a result, GA-specific audio settings (e.g. audio.input.format, audio.output.format, audio.output.voice) are discarded and the session falls back to legacy defaults (audio/pcm at 24kHz). Simply doing new RealtimeAgent({ voice: '...' }) causes carefully chosen telephony formats such as audio/pcmu to be reset.
Debug information
- Agents SDK version:
@openai/agents-realtimev0.1.3 - Runtime environment: Node.js 22.16.0
Repro steps
-
Add the following test file at
packages/agents-realtime/test/realtimeVoiceConfigRegression.test.ts:import { describe, it, expect } from 'vitest'; import { toNewSessionConfig } from '../src/clientMessages'; import { RealtimeAgent } from '../src/realtimeAgent'; import { RealtimeSession } from '../src/realtimeSession'; import { OpenAIRealtimeBase } from '../src/openaiRealtimeBase'; import type { RealtimeClientMessage } from '../src/clientMessages'; const TELEPHONY_AUDIO_FORMAT = { type: 'audio/pcmu' as const }; class CapturingTransport extends OpenAIRealtimeBase { status: 'connected' | 'disconnected' | 'connecting' | 'disconnecting' = 'disconnected'; mergedConfig: any = null; events: RealtimeClientMessage[] = []; async connect(options: { initialSessionConfig?: any }) { this.mergedConfig = (this as any)._getMergedSessionConfig(options.initialSessionConfig ?? {}); } sendEvent(event: RealtimeClientMessage) { this.events.push(event); } mute() {} close() {} interrupt() {} get muted() { return false; } } describe('Realtime session voice config regression', () => { it('drops GA audio formats when top-level voice is present', () => { const converted = toNewSessionConfig({ voice: 'alloy', audio: { input: { format: TELEPHONY_AUDIO_FORMAT }, output: { format: TELEPHONY_AUDIO_FORMAT }, }, }); expect(converted.audio?.input?.format).toEqual(TELEPHONY_AUDIO_FORMAT); expect(converted.audio?.output?.format).toEqual(TELEPHONY_AUDIO_FORMAT); }); it('resets audio formats when connecting a session for an agent with voice configured', async () => { const transport = new CapturingTransport(); const agent = new RealtimeAgent({ name: 'voice-agent', instructions: 'Respond cheerfully.', voice: 'alloy', }); const session = new RealtimeSession(agent, { transport, model: 'gpt-realtime', config: { audio: { input: { format: TELEPHONY_AUDIO_FORMAT }, output: { format: TELEPHONY_AUDIO_FORMAT, voice: 'marin', }, }, }, }); await session.connect({ apiKey: 'dummy-key' }); expect(transport.mergedConfig?.audio?.input?.format).toEqual(TELEPHONY_AUDIO_FORMAT); expect(transport.mergedConfig?.audio?.output?.format).toEqual(TELEPHONY_AUDIO_FORMAT); }); });
-
Run the test:
pnpm test -- --run packages/agents-realtime/test/realtimeVoiceConfigRegression.test.ts -
Observe the failure:
AssertionError: expected { type: 'audio/pcmu' } to deeply equal { type: 'audio/pcm', rate: 24000 }
Expected behavior
Configs that are otherwise GA-shaped should remain in GA form when voice is present. Only configs containing legacy-only fields (e.g. inputAudioFormat) should trigger the legacy conversion path, preserving GA fields like audio.output.voice.