Skip to content

Conversation

@pgrayy
Copy link
Collaborator

@pgrayy pgrayy commented Dec 2, 2025

Description

Isolating the model inference configs to more easily extract them from provider_configs.

Testing

  • I ran hatch run bidi:prepare: Updated unit tests
  • I ran the following test script:
import asyncio
import json

from strands.experimental.bidi import BidiAgent
from strands.experimental.bidi.io import BidiAudioIO, BidiTextIO
from strands.experimental.bidi.models import BidiNovaSonicModel
from strands.experimental.bidi.models.gemini_live import BidiGeminiLiveModel
from strands.experimental.bidi.models.openai_realtime import BidiOpenAIRealtimeModel
from strands.experimental.bidi.tools import stop_conversation

from strands_tools import calculator


async def main() -> None:
    model = BidiNovaSonicModel(
        model_id="amazon.nova-sonic-v1:0",
        provider_config={
            "audio": {
                "voice": "matthew",
            },
            "inference": {
                "max_tokens": 600,
            },
        },
        client_config={"region": "us-east-1"},
    )
    # model = BidiGeminiLiveModel(
    #     model_id="gemini-2.5-flash-native-audio-preview-09-2025",
    #     provider_config={
    #         "audio": {
    #             "voice": "Charon",
    #         },
    #         "inference": {
    #             "max_output_tokens": 600,
    #         },
    #     },
    #     client_config={"api_key": "..."},
    # )
    # model = BidiOpenAIRealtimeModel(
    #     model_id="gpt-realtime",
    #     provider_config={
    #         "audio": {
    #             "voice": "coral",
    #         },
    #         "inference": {
    #             "max_output_tokens": 700

    #         },
    #     },
    #     client_config={"api_key": "..."},
    # )
    agent = BidiAgent(model=model, tools=[calculator, stop_conversation])

    audio_io = BidiAudioIO()
    text_io = BidiTextIO()
    await agent.run(inputs=[audio_io.input()], outputs=[audio_io.output(), text_io.output()])

    print(f"MAIN - stopping agent: {json.dumps(agent.messages, indent=2)}")


if __name__ == "__main__":
    asyncio.run(main())
  • Model responded as expected.
  • Toggling the max tokens did affect out size.
  • Stop conversation tool worked as expected.
  • Interruption worked.

def _resolve_provider_config(self, config: dict[str, Any]) -> dict[str, Any]:
"""Merge user config with defaults (user takes precedence)."""
# Extract voice from provider-specific speech_config.voice_config.prebuilt_voice_config.voice_name if present
provider_voice = None
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Voice is passed in through the "audio" config.

logger = logging.getLogger(__name__)

# Nova Sonic configuration constants
NOVA_INFERENCE_CONFIG = {"maxTokens": 1024, "topP": 0.9, "temperature": 0.7}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to explicitly provide defaults. Nova already has implicit defaults for these that we can rely on.

def _resolve_provider_config(self, config: dict[str, Any]) -> dict[str, Any]:
"""Merge user config with defaults (user takes precedence)."""
# Extract voice from provider-specific audio.output.voice if present
provider_voice = None
Copy link
Collaborator Author

@pgrayy pgrayy Dec 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Voice provided through "audio" config of type AudioConfig.

"input_audio_format",
"output_audio_format",
"input_audio_transcription",
"turn_detection",
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • type always has to be realtime and is already set by us.
  • instructions is set by us through system prompt.
  • voice is set by us through "audio" config.
  • tools is set by us through the passed in tools param.
  • input_audio_format, output_audio_format, input_audio_transcription, and turn_detection are not top-level configs and so would lead to exceptions if setting.

For more details on supported settings, see https://platform.openai.com/docs/api-reference/realtime-client-events/session/update#realtime_client_events-session-update-session.

@pgrayy pgrayy marked this pull request as ready for review December 2, 2025 01:15
@pgrayy pgrayy force-pushed the model-inference-config branch from d469584 to 8d1461c Compare December 2, 2025 01:25
@github-actions github-actions bot added size/m and removed size/m labels Dec 2, 2025
"max_tokens": "maxTokens",
"temperature": "temperature",
"top_p": "topP",
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using to promote consistency. We use snake_case everywhere else.

"input_rate": GEMINI_INPUT_SAMPLE_RATE,
"output_rate": GEMINI_OUTPUT_SAMPLE_RATE,
"channels": GEMINI_CHANNELS,
"format": "pcm",
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we should have a default voice here

Copy link
Collaborator Author

@pgrayy pgrayy Dec 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does work without specifying. I tested that on all the models actually. With that said, we could remove the default voice setting on all configs but I didn't want to make too many changes here.

@pgrayy pgrayy merged commit a46828d into main Dec 2, 2025
11 of 13 checks passed
@pgrayy pgrayy deleted the model-inference-config branch December 2, 2025 01:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants