Skip to content

Realtime session config falls back to legacy format when voice is set #495

@agata

Description

@agata

Please read this first

  • Have you read the docs? Yes – Agents SDK docs
  • Have you searched for related issues? Yes – no existing report matched this behavior.

Describe the bug

When a realtime session config includes the top-level voice field—whether a user supplies it directly or it is injected automatically when you instantiate a RealtimeAgent with voice—the config converter flags the entire payload as legacy. As a result, GA-specific audio settings (e.g. audio.input.format, audio.output.format, audio.output.voice) are discarded and the session falls back to legacy defaults (audio/pcm at 24kHz). Simply doing new RealtimeAgent({ voice: '...' }) causes carefully chosen telephony formats such as audio/pcmu to be reset.

Debug information

  • Agents SDK version: @openai/agents-realtime v0.1.3
  • Runtime environment: Node.js 22.16.0

Repro steps

  1. Add the following test file at packages/agents-realtime/test/realtimeVoiceConfigRegression.test.ts:

    import { describe, it, expect } from 'vitest';
    import { toNewSessionConfig } from '../src/clientMessages';
    import { RealtimeAgent } from '../src/realtimeAgent';
    import { RealtimeSession } from '../src/realtimeSession';
    import { OpenAIRealtimeBase } from '../src/openaiRealtimeBase';
    import type { RealtimeClientMessage } from '../src/clientMessages';
    
    const TELEPHONY_AUDIO_FORMAT = { type: 'audio/pcmu' as const };
    
    class CapturingTransport extends OpenAIRealtimeBase {
      status: 'connected' | 'disconnected' | 'connecting' | 'disconnecting' = 'disconnected';
      mergedConfig: any = null;
      events: RealtimeClientMessage[] = [];
    
      async connect(options: { initialSessionConfig?: any }) {
        this.mergedConfig = (this as any)._getMergedSessionConfig(options.initialSessionConfig ?? {});
      }
    
      sendEvent(event: RealtimeClientMessage) {
        this.events.push(event);
      }
    
      mute() {}
      close() {}
      interrupt() {}
    
      get muted() {
        return false;
      }
    }
    
    describe('Realtime session voice config regression', () => {
      it('drops GA audio formats when top-level voice is present', () => {
        const converted = toNewSessionConfig({
          voice: 'alloy',
          audio: {
            input: { format: TELEPHONY_AUDIO_FORMAT },
            output: { format: TELEPHONY_AUDIO_FORMAT },
          },
        });
    
        expect(converted.audio?.input?.format).toEqual(TELEPHONY_AUDIO_FORMAT);
        expect(converted.audio?.output?.format).toEqual(TELEPHONY_AUDIO_FORMAT);
      });
    
      it('resets audio formats when connecting a session for an agent with voice configured', async () => {
        const transport = new CapturingTransport();
        const agent = new RealtimeAgent({
          name: 'voice-agent',
          instructions: 'Respond cheerfully.',
          voice: 'alloy',
        });
    
        const session = new RealtimeSession(agent, {
          transport,
          model: 'gpt-realtime',
          config: {
            audio: {
              input: { format: TELEPHONY_AUDIO_FORMAT },
              output: {
                format: TELEPHONY_AUDIO_FORMAT,
                voice: 'marin',
              },
            },
          },
        });
    
        await session.connect({ apiKey: 'dummy-key' });
    
        expect(transport.mergedConfig?.audio?.input?.format).toEqual(TELEPHONY_AUDIO_FORMAT);
        expect(transport.mergedConfig?.audio?.output?.format).toEqual(TELEPHONY_AUDIO_FORMAT);
      });
    });
  2. Run the test:

    pnpm test -- --run packages/agents-realtime/test/realtimeVoiceConfigRegression.test.ts
  3. Observe the failure:

    AssertionError: expected { type: 'audio/pcmu' } to deeply equal { type: 'audio/pcm', rate: 24000 }
    

Expected behavior

Configs that are otherwise GA-shaped should remain in GA form when voice is present. Only configs containing legacy-only fields (e.g. inputAudioFormat) should trigger the legacy conversion path, preserving GA fields like audio.output.voice.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions