diff --git a/examples/Realtime_out_of_band_transcription.ipynb b/examples/Realtime_out_of_band_transcription.ipynb new file mode 100644 index 0000000000..63f4e4bf9f --- /dev/null +++ b/examples/Realtime_out_of_band_transcription.ipynb @@ -0,0 +1,1022 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Transcribing User Audio with a Separate Realtime Request\n", + "\n", + "**Purpose**: This notebook demonstrates how to use the Realtime model itself to accurately transcribe user audio `out-of-band` using the same websocket session connection, avoiding errors and inconsistencies common when relying on a separate transcription model (gpt-4o-transcribe/whisper-1).\n", + "\n", + "> We call this out-of-band transcription using the realtime model. It refers to running a separate realtime model request to transcribe the user’s audio outside the live Realtime conversation.\n", + "\n", + "It covers how to build a server-to-server client that:\n", + "\n", + "- Streams microphone audio to an OpenAI Realtime voice agent.\n", + "- Plays back the agent's spoken replies.\n", + "- After each user turn, generates a high-quality text-only transcript using the **same Realtime model**.\n", + "\n", + "This is achieved via a secondary `response.create` request:\n", + "\n", + "```python\n", + "{\n", + " \"type\": \"response.create\",\n", + " \"response\": {\n", + " \"conversation\": \"none\",\n", + " \"output_modalities\": [\"text\"],\n", + " \"instructions\": transcription_instructions\n", + " }\n", + "}\n", + "```\n", + "\n", + "This notebook demonstrates using the **Realtime model itself** for transcription:\n", + "\n", + "- **Context-aware transcription**: Uses the full session context to improve transcript accuracy.\n", + "- **Non-intrusive**: Runs outside the live conversation, so the transcript is never added back to session state.\n", + "- **Customizable instructions**: Allows tailoring transcription prompts to specific use-cases. Realtime model is better than the transcription model at following instructions." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 1. Why use out-of-band transcription?\n", + "\n", + "The Realtime API offers built-in user input transcription, but this relies on a **separate ASR model** (e.g., gpt-4o-transcribe). Using different models for transcription and response generation can lead to discrepancies. For example:\n", + "\n", + "- User speech transcribed as: `I had otoo accident`\n", + "- Realtime response interpreted correctly as: `Got it, you had an auto accident`\n", + "\n", + "Accurate transcriptions can be very important, particularly when:\n", + "\n", + "- Transcripts trigger downstream actions (e.g., tool calls), where errors propagate through the system.\n", + "- Transcripts are summarized or passed to other components, risking context pollution.\n", + "- Transcripts are displayed to end users, leading to poor user experiences if errors occur.\n", + "\n", + "The potential advantages of using out-of-band transcription include:\n", + "- **Reduced Mismatch**: The same model is used for both transcription and generation, minimizing inconsistencies between what the user says and how the agent responds.\n", + "- **Greater Steerability**: The Realtime model is more steerable, can better follow custom instructions for higher transcription quality, and is not limited by a 1024-token input maximum.\n", + "- **Session Context Awareness**: The model has access to the full session context, so, for example, if you mention your name multiple times, it will transcribe it correctly.\n", + "\n", + "\n", + "In terms of **trade-offs**:\n", + "\n", + "- Realtime Model (for transcription):\n", + " - Audio Input → Text Output: $32.00 per 1M audio tokens + $16.00 per 1M text tokens out.\n", + " - Cached Session Context: $0.40 per 1M cached context tokens (typically negligible).\n", + "\n", + " - Total Cost (for 1M audio tokens in + 1M text tokens out): ≈ $48.00\n", + "\n", + "- GPT-4o Transcription:\n", + "\n", + " - Audio Input: $6.00 per 1M audio tokens\n", + "\n", + " - Text Input: $2.50 per 1M tokens (capped at 1024 tokens, negligible input prompt)\n", + "\n", + " - Text Output: $10.00 per 1M tokens\n", + "\n", + " - Total Cost (for 1M audio tokens in + 1M text tokens out): ≈ $16.00\n", + "\n", + "- Direct Cost Comparison:\n", + "\n", + " - Realtime Transcription: ~$48.00\n", + "\n", + " - GPT-4o Transcription: ~$16.00\n", + "\n", + " - Absolute Difference: $48.00 − $16.00 = $32.00\n", + "\n", + " - Cost Ratio: $48.00 / $16.00 = 3×\n", + "\n", + " Note: Costs related to cached session context ($0.40 per 1M tokens) and the capped text input tokens for GPT-4o ($2.50 per 1M tokens) are negligible and thus excluded from detailed calculations.\n", + "\n", + "- Other Considerations:\n", + "\n", + " - Implementing transcription via the realtime model might be slightly more complex compared to using the built-in GPT-4o transcription option through the Realtime API.\n", + "\n", + "> Note: Ouf-of-band responses using the realtime model can be used for other use cases beyond user turn transcription. Examples include generating structured summaries, triggering background actions, or performing validation tasks without affecting the main conversation.\n", + "\n", + "\"drawing\"\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 2. Requirements & Setup\n", + "\n", + "Ensure your environment meets these requirements:\n", + "\n", + "1. **Python 3.10 or later**\n", + "\n", + "2. **PortAudio** (required by `sounddevice`):\n", + " - macOS:\n", + " ```bash\n", + " brew install portaudio\n", + " ```\n", + "\n", + "3. **Python Dependencies**:\n", + " ```bash\n", + " pip install sounddevice websockets\n", + " ```\n", + "\n", + "4. **OpenAI API Key** (with Realtime API access):\n", + " Set your key as an environment variable:\n", + "\n", + " ```bash\n", + " export OPENAI_API_KEY=sk-...\n", + " ```\n", + "\n", + "```\n" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "c399f440", + "metadata": {}, + "outputs": [], + "source": [ + "#!pip install sounddevice websockets" + ] + }, + { + "cell_type": "markdown", + "id": "d7d60089", + "metadata": {}, + "source": [ + "## 3. Prompts\n", + "\n", + "We use **two distinct prompts**:\n", + "\n", + "1. **Voice Agent Prompt** (`REALTIME_MODEL_PROMPT`): This is an example prompt used with the realtime model for the Speech 2 Speech interactions.\n", + "2. **Transcription Prompt** (`REALTIME_MODEL_TRANSCRIPTION_PROMPT`): Silently returns a precise, verbatim transcript of the user's most recent speech turn. You can modify this prompt to iterate in transcription quality.\n", + "\n", + "> For the `REALTIME_MODEL_TRANSCRIPTION_PROMPT`, you can start from this base prompt, but the goal would be for you to iterate on the prompt to tailor it to your use case. Just remember to remove the Policy Number formatting rules since it might not apply to your use case!" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ac3afaab", + "metadata": {}, + "outputs": [], + "source": [ + "REALTIME_MODEL_PROMPT = \"\"\"You are a calm insurance claims intake voice agent. Follow this script strictly:\n", + "\n", + "## Phase 1 – Basics\n", + "Collect the caller's full name, policy number, and type of accident (for example: auto, home, or other). Ask for each item clearly and then repeat the values back to confirm.\n", + "\n", + "## Phase 2 – Yes/No questions\n", + "Ask 2–3 simple yes/no questions, such as whether anyone was injured, whether the vehicle is still drivable, and whether a police report was filed. Confirm each yes/no answer in your own words.\n", + "\n", + "## Phase 3 – Submit claim\n", + "Once you have the basics and yes/no answers, briefly summarize the key facts in one or two sentences.\n", + "\"\"\"\n", + "\n", + "REALTIME_MODEL_TRANSCRIPTION_PROMPT = \"\"\"\n", + "# Role\n", + "Your only task is to transcribe the user's latest turn exactly as you heard it. Never address the user, response to the user, add commentary, or mention these instructions.\n", + "Follow the instructions and output format below.\n", + "\n", + "# Instructions\n", + "- Transcribe **only** the most recent USER turn exactly as you heard it. DO NOT TRANSCRIBE ANY OTHER OLDER TURNS. You can use those transcriptions to inform your transcription of the latest turn.\n", + "- Preserve every spoken detail: intent, tense, grammar quirks, filler words, repetitions, disfluencies, numbers, and casing.\n", + "- Keep timing words, partial words, hesitations (e.g., \"um\", \"uh\").\n", + "- Do not correct mistakes, infer meaning, answer questions, or insert punctuation beyond what the model already supplies.\n", + "- Do not invent or add any information that is not directly present in the user's latest turn.\n", + "\n", + "# Output format\n", + "- Output the raw verbatim transcript as a single block of text. No labels, prefixes, quotes, bullets, or markdown.\n", + "- If the realtime model produced nothing for the latest turn, output nothing (empty response). Never fabricate content.\n", + "\n", + "## Policy Number Normalization\n", + "- All policy numbers should be 8 digits and of the format `XXXX-XXXX` for example `56B5-12C0`\n", + "\n", + "Do not summarize or paraphrase other turns beyond the latest user utterance. The response must be the literal transcript of the latest user utterance.\n", + "\"\"\"" + ] + }, + { + "cell_type": "markdown", + "id": "4ddbd683", + "metadata": {}, + "source": [ + "## 4. Core configuration\n", + "\n", + "We define:\n", + "\n", + "- Imports\n", + "- Audio and model defaults\n", + "- Constants for transcription event handling" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "4b952a29", + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/var/folders/vd/l97lv64j3678b905tff4bc0h0000gp/T/ipykernel_91319/2514869342.py:10: DeprecationWarning: websockets.client.WebSocketClientProtocol is deprecated\n", + " from websockets.client import WebSocketClientProtocol\n" + ] + } + ], + "source": [ + "import asyncio\n", + "import base64\n", + "import json\n", + "import os\n", + "from collections import defaultdict, deque\n", + "from typing import Any\n", + "\n", + "import sounddevice as sd\n", + "import websockets\n", + "from websockets.client import WebSocketClientProtocol\n", + "\n", + "# Basic defaults\n", + "DEFAULT_MODEL = \"gpt-realtime\"\n", + "DEFAULT_VOICE = \"marin\"\n", + "DEFAULT_SAMPLE_RATE = 24_000\n", + "DEFAULT_BLOCK_MS = 100\n", + "DEFAULT_SILENCE_DURATION_MS = 800\n", + "DEFAULT_PREFIX_PADDING_MS = 300\n", + "TRANSCRIPTION_PURPOSE = \"User turn transcription\"" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "id": "7254080a", + "metadata": {}, + "outputs": [], + "source": [ + "# Event grouping constants\n", + "TRANSCRIPTION_DELTA_TYPES = {\n", + " \"input_audio_buffer.transcription.delta\",\n", + " \"input_audio_transcription.delta\",\n", + " \"conversation.item.input_audio_transcription.delta\",\n", + "}\n", + "TRANSCRIPTION_COMPLETE_TYPES = {\n", + " \"input_audio_buffer.transcription.completed\",\n", + " \"input_audio_buffer.transcription.done\",\n", + " \"input_audio_transcription.completed\",\n", + " \"input_audio_transcription.done\",\n", + " \"conversation.item.input_audio_transcription.completed\",\n", + " \"conversation.item.input_audio_transcription.done\",\n", + "}\n", + "INPUT_SPEECH_END_EVENT_TYPES = {\n", + " \"input_audio_buffer.speech_stopped\",\n", + " \"input_audio_buffer.committed\",\n", + "}\n", + "RESPONSE_AUDIO_DELTA_TYPES = {\n", + " \"response.output_audio.delta\",\n", + " \"response.audio.delta\",\n", + "}\n", + "RESPONSE_TEXT_DELTA_TYPES = {\n", + " \"response.output_text.delta\",\n", + " \"response.text.delta\",\n", + "}\n", + "RESPONSE_AUDIO_TRANSCRIPT_DELTA_TYPES = {\n", + " \"response.output_audio_transcript.delta\",\n", + " \"response.audio_transcript.delta\",\n", + "}" + ] + }, + { + "cell_type": "markdown", + "id": "a905ec16", + "metadata": {}, + "source": [ + "## 5. Building the Realtime session & the out‑of‑band request\n", + "\n", + "The Realtime session (`session.update`) configures:\n", + "\n", + "- Audio input/output\n", + "- Server‑side VAD\n", + "- Set built‑in transcription (`input_audio_transcription_model`)\n", + " + We set this so that we can compare to the realtime model transcription\n", + "\n", + "The out‑of‑band transcription is a `response.create` trigerred after user input audio is committed `input_audio_buffer.committed`:\n", + "\n", + "- `conversation: \"none\"` – use session state but don’t write to the main conversation session state\n", + "- `output_modalities: [\"text\"]` – get a text transcript only\n", + "\n", + "> Note: The REALTIME_MODEL_TRANSCRIPTION_PROMPT is not passed to the gpt-4o-transcribe model because the Realtime API enforces a 1024 token maximum for prompts.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "id": "4baf1870", + "metadata": {}, + "outputs": [], + "source": [ + "def build_session_update(\n", + " instructions: str,\n", + " voice: str,\n", + " vad_threshold: float,\n", + " silence_duration_ms: int,\n", + " prefix_padding_ms: int,\n", + " idle_timeout_ms: int | None,\n", + " input_audio_transcription_model: str | None = None,\n", + ") -> dict[str, object]:\n", + " \"\"\"Configure the Realtime session: audio in/out, server VAD, etc.\"\"\"\n", + "\n", + " turn_detection: dict[str, float | int | bool | str] = {\n", + " \"type\": \"server_vad\",\n", + " \"threshold\": vad_threshold,\n", + " \"silence_duration_ms\": silence_duration_ms,\n", + " \"prefix_padding_ms\": prefix_padding_ms,\n", + " \"create_response\": True,\n", + " \"interrupt_response\": True,\n", + " }\n", + "\n", + " if idle_timeout_ms is not None:\n", + " turn_detection[\"idle_timeout_ms\"] = idle_timeout_ms\n", + "\n", + " audio_config: dict[str, Any] = {\n", + " \"input\": {\n", + " \"format\": {\n", + " \"type\": \"audio/pcm\",\n", + " \"rate\": DEFAULT_SAMPLE_RATE,\n", + " },\n", + " \"noise_reduction\": {\"type\": \"near_field\"},\n", + " \"turn_detection\": turn_detection,\n", + " },\n", + " \"output\": {\n", + " \"format\": {\n", + " \"type\": \"audio/pcm\",\n", + " \"rate\": DEFAULT_SAMPLE_RATE,\n", + " },\n", + " \"voice\": voice,\n", + " },\n", + " }\n", + "\n", + " # Optional: built-in transcription model for comparison\n", + " if input_audio_transcription_model:\n", + " audio_config[\"input\"][\"transcription\"] = {\n", + " \"model\": input_audio_transcription_model,\n", + " }\n", + "\n", + " session: dict[str, object] = {\n", + " \"type\": \"realtime\",\n", + " \"output_modalities\": [\"audio\"],\n", + " \"instructions\": instructions,\n", + " \"audio\": audio_config,\n", + " }\n", + "\n", + " return {\n", + " \"type\": \"session.update\",\n", + " \"session\": session,\n", + " }\n", + "\n", + "\n", + "def build_transcription_request(transcription_instructions: str) -> dict[str, object]:\n", + " \"\"\"Ask the SAME Realtime model for an out-of-band transcript of the latest user turn.\"\"\"\n", + "\n", + " return {\n", + " \"type\": \"response.create\",\n", + " \"response\": {\n", + " \"conversation\": \"none\", # <--- out-of-band\n", + " \"output_modalities\": [\"text\"],\n", + " \"metadata\": {\"purpose\": TRANSCRIPTION_PURPOSE}, # <--- we add metadata so it is easier to identify the event in the logs\n", + " \"instructions\": transcription_instructions,\n", + " },\n", + " }\n" + ] + }, + { + "cell_type": "markdown", + "id": "9afe7911", + "metadata": {}, + "source": [ + "## 6. Audio streaming: mic → Realtime → speakers\n", + "\n", + "We now define:\n", + "\n", + "- `encode_audio` – base64 helper\n", + "- `playback_audio` – play assistant audio on the default output device\n", + "- `send_audio_from_queue` – send buffered mic audio to `input_audio_buffer`\n", + "- `stream_microphone_audio` – capture PCM16 from the mic and feed the queue\n" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "id": "11218bbb", + "metadata": {}, + "outputs": [], + "source": [ + "def encode_audio(chunk: bytes) -> str:\n", + " \"\"\"Base64-encode a PCM audio chunk for WebSocket transport.\"\"\"\n", + " return base64.b64encode(chunk).decode(\"utf-8\")\n", + "\n", + "\n", + "async def playback_audio(\n", + " playback_queue: asyncio.Queue,\n", + " stop_event: asyncio.Event,\n", + ") -> None:\n", + " \"\"\"Stream assistant audio back to the speakers in (near) real time.\"\"\"\n", + "\n", + " try:\n", + " with sd.RawOutputStream(\n", + " samplerate=DEFAULT_SAMPLE_RATE,\n", + " channels=1,\n", + " dtype=\"int16\",\n", + " ) as stream:\n", + " while not stop_event.is_set():\n", + " chunk = await playback_queue.get()\n", + " if chunk is None:\n", + " break\n", + " try:\n", + " stream.write(chunk)\n", + " except Exception as exc:\n", + " print(f\"Audio playback error: {exc}\", flush=True)\n", + " break\n", + " except Exception as exc:\n", + " print(f\"Failed to open audio output stream: {exc}\", flush=True)\n", + "\n", + "\n", + "async def send_audio_from_queue(\n", + " ws: WebSocketClientProtocol,\n", + " queue: asyncio.Queue[bytes | None],\n", + " stop_event: asyncio.Event,\n", + ") -> None:\n", + " \"\"\"Push raw PCM chunks into input_audio_buffer via the WebSocket.\"\"\"\n", + "\n", + " while not stop_event.is_set():\n", + " chunk = await queue.get()\n", + " if chunk is None:\n", + " break\n", + " encoded_chunk = encode_audio(chunk)\n", + " message = {\"type\": \"input_audio_buffer.append\", \"audio\": encoded_chunk}\n", + " await ws.send(json.dumps(message))\n", + "\n", + " if not ws.closed:\n", + " commit_payload = {\"type\": \"input_audio_buffer.commit\"}\n", + " await ws.send(json.dumps(commit_payload))\n", + "\n", + "\n", + "async def stream_microphone_audio(\n", + " ws: WebSocketClientProtocol,\n", + " stop_event: asyncio.Event,\n", + " shared_state: dict,\n", + " block_ms: int = DEFAULT_BLOCK_MS,\n", + ") -> None:\n", + " \"\"\"Capture live microphone audio and send it to the realtime session.\"\"\"\n", + "\n", + " loop = asyncio.get_running_loop()\n", + " audio_queue: asyncio.Queue[bytes | None] = asyncio.Queue()\n", + " blocksize = int(DEFAULT_SAMPLE_RATE * (block_ms / 1000))\n", + "\n", + " def on_audio(indata, frames, time_info, status): # type: ignore[override]\n", + " \"\"\"Capture a mic callback chunk and enqueue it unless the mic is muted.\"\"\"\n", + " if status:\n", + " print(f\"Microphone status: {status}\", flush=True)\n", + " # Simple echo protection: mute mic when assistant is talking\n", + " if not stop_event.is_set() and not shared_state.get(\"mute_mic\", False):\n", + " data = bytes(indata)\n", + " loop.call_soon_threadsafe(audio_queue.put_nowait, data)\n", + "\n", + " print(\n", + " f\"Streaming microphone audio at {DEFAULT_SAMPLE_RATE} Hz (mono). \"\n", + " \"Speak naturally; server VAD will stop listening when you pause.\"\n", + " )\n", + " sender = asyncio.create_task(send_audio_from_queue(ws, audio_queue, stop_event))\n", + "\n", + " with sd.RawInputStream(\n", + " samplerate=DEFAULT_SAMPLE_RATE,\n", + " blocksize=blocksize,\n", + " channels=1,\n", + " dtype=\"int16\",\n", + " callback=on_audio,\n", + " ):\n", + " await stop_event.wait()\n", + "\n", + " await audio_queue.put(None)\n", + " await sender" + ] + }, + { + "cell_type": "markdown", + "id": "d02cc1bd", + "metadata": {}, + "source": [ + "## 7. Extracting and comparing transcripts\n", + "\n", + "The function below enables us to generate **two transcripts** for each user turn:\n", + "\n", + "- **Realtime model transcript**: from our out-of-band `response.create` call.\n", + "- **Built-in ASR transcript**: from the standard transcription model (`input_audio_transcription_model`).\n", + "\n", + "We align and display both clearly in the terminal:\n", + "\n", + "```text\n", + "=== User Turn (Realtime Transcript) ===\n", + "...\n", + "\n", + "=== User Turn (Built-in ASR Transcript) ===\n", + "...\n" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "id": "cb6acbf0", + "metadata": {}, + "outputs": [], + "source": [ + "def flush_pending_transcription_prints(shared_state: dict) -> None:\n", + " \"\"\"Whenever we've printed a realtime transcript, print the matching transcription-model output.\"\"\"\n", + "\n", + " pending_prints: deque | None = shared_state.get(\"pending_transcription_prints\")\n", + " input_transcripts: deque | None = shared_state.get(\"input_transcripts\")\n", + "\n", + " if not pending_prints or not input_transcripts:\n", + " return\n", + "\n", + " while pending_prints and input_transcripts:\n", + " comparison_text = input_transcripts.popleft()\n", + " pending_prints.popleft()\n", + " print(\"=== User turn (Transcription model) ===\")\n", + " if comparison_text:\n", + " print(comparison_text, flush=True)\n", + " print()\n", + " else:\n", + " print(\"\", flush=True)\n", + " print()\n" + ] + }, + { + "cell_type": "markdown", + "id": "6025bbf6", + "metadata": {}, + "source": [ + "## 8. Listening for Realtime events\n", + "\n", + "`listen_for_events` drives the session:\n", + "\n", + "- Watches for `speech_started` / `speech_stopped` / `committed`\n", + "- Sends the out‑of‑band transcription request when a user turn finishes (`input_audio_buffer.committed`)\n", + "- Streams assistant audio to the playback queue\n", + "- Buffers text deltas per `response_id`" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "id": "d099babd", + "metadata": {}, + "outputs": [], + "source": [ + "async def listen_for_events(\n", + " ws: WebSocketClientProtocol,\n", + " stop_event: asyncio.Event,\n", + " transcription_instructions: str,\n", + " max_turns: int | None,\n", + " playback_queue: asyncio.Queue,\n", + " shared_state: dict,\n", + ") -> None:\n", + " \"\"\"Print assistant text + transcripts and coordinate mic muting.\"\"\"\n", + "\n", + " responses: dict[str, dict[str, bool]] = {}\n", + " buffers: defaultdict[str, str] = defaultdict(str)\n", + " transcription_model_buffers: defaultdict[str, str] = defaultdict(str)\n", + " completed_main_responses = 0\n", + " awaiting_transcription_prompt = False\n", + " input_transcripts = shared_state.setdefault(\"input_transcripts\", deque())\n", + " pending_transcription_prints = shared_state.setdefault(\n", + " \"pending_transcription_prints\", deque()\n", + " )\n", + "\n", + " async for raw in ws:\n", + " if stop_event.is_set():\n", + " break\n", + "\n", + " message = json.loads(raw)\n", + " message_type = message.get(\"type\")\n", + "\n", + " # --- User speech events -------------------------------------------------\n", + " if message_type == \"input_audio_buffer.speech_started\":\n", + " print(\"\\n[client] Speech detected; streaming...\", flush=True)\n", + " awaiting_transcription_prompt = True\n", + "\n", + " elif message_type in INPUT_SPEECH_END_EVENT_TYPES:\n", + " if message_type == \"input_audio_buffer.speech_stopped\":\n", + " print(\"[client] Detected silence; preparing transcript...\", flush=True)\n", + "\n", + " # This is where the out-of-band transcription request is sent. <-------\n", + " if awaiting_transcription_prompt:\n", + " request_payload = build_transcription_request(\n", + " transcription_instructions\n", + " )\n", + " await ws.send(json.dumps(request_payload))\n", + " awaiting_transcription_prompt = False\n", + "\n", + " # --- Built-in transcription model stream -------------------------------\n", + " elif message_type in TRANSCRIPTION_DELTA_TYPES:\n", + " buffer_id = message.get(\"buffer_id\") or message.get(\"item_id\") or \"default\"\n", + " delta_text = (\n", + " message.get(\"delta\")\n", + " or (message.get(\"transcription\") or {}).get(\"text\")\n", + " or \"\"\n", + " )\n", + " if delta_text:\n", + " transcription_model_buffers[buffer_id] += delta_text\n", + "\n", + " elif message_type in TRANSCRIPTION_COMPLETE_TYPES:\n", + " buffer_id = message.get(\"buffer_id\") or message.get(\"item_id\") or \"default\"\n", + " final_text = (\n", + " (message.get(\"transcription\") or {}).get(\"text\")\n", + " or message.get(\"transcript\")\n", + " or \"\"\n", + " )\n", + " if not final_text:\n", + " final_text = transcription_model_buffers.pop(buffer_id, \"\").strip()\n", + " else:\n", + " transcription_model_buffers.pop(buffer_id, None)\n", + "\n", + " if not final_text:\n", + " item = message.get(\"item\")\n", + " if item:\n", + " final_text = item.get(\"transcription\")\n", + " final_text = final_text or \"\"\n", + "\n", + " final_text = final_text.strip()\n", + " if final_text:\n", + " input_transcripts.append(final_text)\n", + " flush_pending_transcription_prints(shared_state)\n", + "\n", + " # --- Response lifecycle (Realtime model) --------------------------------\n", + " elif message_type == \"response.created\":\n", + " response = message.get(\"response\", {})\n", + " response_id = response.get(\"id\")\n", + " metadata = response.get(\"metadata\") or {}\n", + " responses[response_id] = {\n", + " \"is_transcription\": metadata.get(\"purpose\") == TRANSCRIPTION_PURPOSE,\n", + " \"done\": False,\n", + " }\n", + "\n", + " elif message_type in RESPONSE_AUDIO_DELTA_TYPES:\n", + " response_id = message.get(\"response_id\")\n", + " if response_id is None:\n", + " continue\n", + " b64_audio = message.get(\"delta\") or message.get(\"audio\")\n", + " if not b64_audio:\n", + " continue\n", + " try:\n", + " audio_chunk = base64.b64decode(b64_audio)\n", + " except Exception:\n", + " continue\n", + "\n", + " if (\n", + " response_id in responses\n", + " and not responses[response_id][\"is_transcription\"]\n", + " ):\n", + " shared_state[\"mute_mic\"] = True\n", + "\n", + " await playback_queue.put(audio_chunk)\n", + "\n", + " elif message_type in RESPONSE_TEXT_DELTA_TYPES:\n", + " response_id = message.get(\"response_id\")\n", + " if response_id is None:\n", + " continue\n", + " buffers[response_id] += message.get(\"delta\", \"\")\n", + " \n", + "\n", + " elif message_type in RESPONSE_AUDIO_TRANSCRIPT_DELTA_TYPES:\n", + " response_id = message.get(\"response_id\")\n", + " if response_id is None:\n", + " continue\n", + " buffers[response_id] += message.get(\"delta\", \"\") \n", + "\n", + " elif message_type == \"response.done\":\n", + " response = message.get(\"response\", {})\n", + " response_id = response.get(\"id\")\n", + " if response_id is None:\n", + " continue\n", + " if response_id not in responses:\n", + " responses[response_id] = {\"is_transcription\": False, \"done\": False}\n", + " responses[response_id][\"done\"] = True\n", + "\n", + " is_transcription = responses[response_id][\"is_transcription\"]\n", + " text = buffers.get(response_id, \"\").strip()\n", + " if text:\n", + " if is_transcription:\n", + " print(\"\\n=== User turn (Realtime transcript) ===\")\n", + " print(text, flush=True)\n", + " print()\n", + " pending_transcription_prints.append(object())\n", + " flush_pending_transcription_prints(shared_state)\n", + " else:\n", + " print(\"\\n=== Assistant response ===\")\n", + " print(text, flush=True)\n", + " print()\n", + "\n", + " if not is_transcription:\n", + " shared_state[\"mute_mic\"] = False\n", + " completed_main_responses += 1\n", + "\n", + " if max_turns is not None and completed_main_responses >= max_turns:\n", + " stop_event.set()\n", + " break\n", + "\n", + " elif message_type == \"error\":\n", + " print(f\"Error from server: {message}\")\n", + "\n", + " else:\n", + " pass\n", + "\n", + " await asyncio.sleep(0)" + ] + }, + { + "cell_type": "markdown", + "id": "10c69ded", + "metadata": {}, + "source": [ + "## 9. Run Script\n", + "\n", + "In this step, we run the the code which will allow us to view the realtime model transcription vs transcription model transcriptions. The code does the following:\n", + "\n", + "- Loads configuration and prompts\n", + "- Establishes a WebSocket connection\n", + "- Starts concurrent tasks:\n", + " - `listen_for_events` (handle incoming messages)\n", + " - `stream_microphone_audio` (send microphone audio)\n", + " - Mutes mic when assistant is speaking\n", + " - `playback_audio` (play assistant responses)\n", + " - prints realtime and transcription model transcripts when they are both returned. It uses shared_state to ensure both are returned before printing.\n", + "- Run session until you `interrupt`\n", + "\n", + "Output should look like:\n", + "```python\n", + "[client] Speech detected; streaming...\n", + "[client] Detected silence; preparing transcript...\n", + "\n", + "=== User turn (Realtime transcript) ===\n", + "Hello.\n", + "\n", + "=== User turn (Transcription model) ===\n", + "Hello\n", + "\n", + "\n", + "=== Assistant response ===\n", + "Hello, and thank you for calling. Let's start with your full name, please.\n", + "```\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "35c4d7b5", + "metadata": {}, + "outputs": [], + "source": [ + "async def run_realtime_session(\n", + " api_key: str | None = None,\n", + " server: str = \"wss://api.openai.com/v1/realtime\",\n", + " model: str = DEFAULT_MODEL,\n", + " voice: str = DEFAULT_VOICE,\n", + " instructions: str = REALTIME_MODEL_PROMPT,\n", + " transcription_instructions: str = REALTIME_MODEL_TRANSCRIPTION_PROMPT,\n", + " input_audio_transcription_model: str | None = \"gpt-4o-transcribe\",\n", + " silence_duration_ms: int = DEFAULT_SILENCE_DURATION_MS,\n", + " prefix_padding_ms: int = DEFAULT_PREFIX_PADDING_MS,\n", + " vad_threshold: float = 0.6,\n", + " idle_timeout_ms: int | None = None,\n", + " max_turns: int | None = None,\n", + " timeout_seconds: int = 0,\n", + ") -> None:\n", + " \"\"\"Connect to the Realtime API, stream audio both ways, and print transcripts.\"\"\"\n", + " api_key = api_key or os.environ.get(\"OPENAI_API_KEY\")\n", + " ws_url = f\"{server}?model={model}\"\n", + " headers = {\n", + " \"Authorization\": f\"Bearer {api_key}\",\n", + " }\n", + "\n", + " session_update_payload = build_session_update(\n", + " instructions=instructions,\n", + " voice=voice,\n", + " vad_threshold=vad_threshold,\n", + " silence_duration_ms=silence_duration_ms,\n", + " prefix_padding_ms=prefix_padding_ms,\n", + " idle_timeout_ms=idle_timeout_ms,\n", + " input_audio_transcription_model=input_audio_transcription_model,\n", + " )\n", + " stop_event = asyncio.Event()\n", + " playback_queue: asyncio.Queue = asyncio.Queue()\n", + " shared_state: dict = {\n", + " \"mute_mic\": False,\n", + " \"input_transcripts\": deque(),\n", + " \"pending_transcription_prints\": deque(),\n", + " }\n", + "\n", + " async with websockets.connect(\n", + " ws_url, additional_headers=headers, max_size=None\n", + " ) as ws:\n", + " await ws.send(json.dumps(session_update_payload))\n", + "\n", + " listener_task = asyncio.create_task(\n", + " listen_for_events(\n", + " ws,\n", + " stop_event=stop_event,\n", + " transcription_instructions=transcription_instructions,\n", + " max_turns=max_turns,\n", + " playback_queue=playback_queue,\n", + " shared_state=shared_state,\n", + " )\n", + " )\n", + " mic_task = asyncio.create_task(\n", + " stream_microphone_audio(ws, stop_event, shared_state=shared_state)\n", + " )\n", + " playback_task = asyncio.create_task(playback_audio(playback_queue, stop_event))\n", + "\n", + " try:\n", + " if timeout_seconds and timeout_seconds > 0:\n", + " await asyncio.wait_for(stop_event.wait(), timeout=timeout_seconds)\n", + " else:\n", + " await stop_event.wait()\n", + " except asyncio.TimeoutError:\n", + " print(\"Timed out waiting for responses; closing.\")\n", + " except asyncio.CancelledError:\n", + " print(\"Session cancelled; closing.\")\n", + " finally:\n", + " stop_event.set()\n", + " await playback_queue.put(None)\n", + " await ws.close()\n", + " await asyncio.gather(\n", + " listener_task, mic_task, playback_task, return_exceptions=True\n", + " )" + ] + }, + { + "cell_type": "code", + "execution_count": 29, + "id": "c9a2a33b", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Streaming microphone audio at 24000 Hz (mono). Speak naturally; server VAD will stop listening when you pause.\n", + "\n", + "[client] Speech detected; streaming...\n", + "[client] Detected silence; preparing transcript...\n", + "\n", + "=== User turn (Realtime transcript) ===\n", + "Hello.\n", + "\n", + "=== User turn (Transcription model) ===\n", + "Hello\n", + "\n", + "\n", + "=== Assistant response ===\n", + "Hello! Let's get started with your claim. Can you tell me your full name, please?\n", + "\n", + "\n", + "[client] Speech detected; streaming...\n", + "[client] Detected silence; preparing transcript...\n", + "\n", + "=== User turn (Realtime transcript) ===\n", + "My name is M I N H A J U L H O Q U E\n", + "\n", + "=== User turn (Transcription model) ===\n", + "My name is Minhajul Hoque.\n", + "\n", + "\n", + "=== Assistant response ===\n", + "Thank you. Just to confirm, I heard your full name as Minhajul Hoque. Is that correct?\n", + "\n", + "\n", + "[client] Speech detected; streaming...\n", + "[client] Detected silence; preparing transcript...\n", + "\n", + "=== User turn (Realtime transcript) ===\n", + "Yep.\n", + "\n", + "=== User turn (Transcription model) ===\n", + "Yep.\n", + "\n", + "\n", + "=== Assistant response ===\n", + "Great, thank you for confirming. Now, could you provide your policy number, please?\n", + "\n", + "\n", + "[client] Speech detected; streaming...\n", + "[client] Detected silence; preparing transcript...\n", + "\n", + "=== User turn (Realtime transcript) ===\n", + "My policy number is X077-B025.\n", + "\n", + "=== User turn (Transcription model) ===\n", + "My policy number is X077B025.\n", + "\n", + "\n", + "=== Assistant response ===\n", + "Thank you. Let me confirm: I have your policy number as X077B025. Is that correct?\n", + "\n", + "\n", + "[client] Speech detected; streaming...\n", + "[client] Detected silence; preparing transcript...\n", + "\n", + "=== Assistant response ===\n", + "Of course. Your full name is Minhajul Hoque. Now, let’s move on. What type of accident are you reporting—auto, home, or something else?\n", + "\n", + "\n", + "=== User turn (Realtime transcript) ===\n", + "Yeah, can you ask me my name again?\n", + "\n", + "=== User turn (Transcription model) ===\n", + "Can you ask me my name again?\n", + "\n", + "\n", + "[client] Speech detected; streaming...\n", + "[client] Detected silence; preparing transcript...\n", + "\n", + "=== User turn (Realtime transcript) ===\n", + "No, can you ask me my name again, this is important.\n", + "\n", + "=== User turn (Transcription model) ===\n", + "No, can you ask me by name again?\n", + "\n", + "\n", + "=== Assistant response ===\n", + "Understood. Let me repeat your full name again to confirm. Your name is Minhajul Hoque. Is that correct?\n", + "\n", + "\n", + "[client] Speech detected; streaming...\n", + "[client] Detected silence; preparing transcript...\n", + "\n", + "=== User turn (Realtime transcript) ===\n", + "My name is Minhajul Hoque.\n", + "\n", + "=== User turn (Transcription model) ===\n", + "My name is Minhaj ul Haq.\n", + "\n", + "Session cancelled; closing.\n" + ] + } + ], + "source": [ + "await run_realtime_session()" + ] + }, + { + "cell_type": "markdown", + "id": "efabdbf5", + "metadata": {}, + "source": [ + "From the above example, we can notice:\n", + "- The Realtime Model Transcription quality matches or surpasses that of the transcription model in various turns. In one of the turns, the transcription model misses \"this is important.\" while the realtime transcription gets it correctly.\n", + "- The realtime model correctly applies rules for Policy Number formatting (XXXX-XXXX).\n", + "- With context from the entire session, including previous turns where I spelled out my name, the realtime model accurately transcribes my name when the assistant asked my name again while the transcription model makes errors (e.g., \"Minhaj ul Haq\")." + ] + }, + { + "cell_type": "markdown", + "id": "6d8ac6e3", + "metadata": {}, + "source": [ + "# Conclusion\n", + "\n", + "Exploring **out-of-band transcription** could be beneficial for your use case if:\n", + "\n", + "* You're still experiencing unreliable transcriptions, even after optimizing the transcription model prompt.\n", + "* You need a more reliable and steerable method for generating transcriptions.\n", + "* The current transcripts fail to normalize entities correctly, causing downstream issues.\n", + "\n", + "If you decide to pursue this method, make sure you:\n", + "\n", + "* Set up the transcription trigger correctly, ensuring it activates after the audio commit.\n", + "* Carefully iterate and refine the prompt to align closely with your specific use case and needs.\n" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "openai", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.8" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/examples/Realtime_prompting_guide.ipynb b/examples/Realtime_prompting_guide.ipynb index fdbb9d4418..fa5dd7909f 100644 --- a/examples/Realtime_prompting_guide.ipynb +++ b/examples/Realtime_prompting_guide.ipynb @@ -993,6 +993,32 @@ "In this example, the model asks for clarification after my *(very)* loud cough and unclear audio." ] }, + { + "cell_type": "markdown", + "id": "e0ce8f30", + "metadata": {}, + "source": [ + "## Background Music or Sounds\n", + "Occasionally, the model may generate unintended background music, humming, rhythmic noises, or sound-like artifacts during speech generation. These artifacts can diminish clarity, distract users, or make the assistant feel less professional. The following instructions helps prevent or significantly reduce these occurrences.\n", + "\n", + "- **When to use**: Use when you observe unintended musical elements or sound effects in Realtime audio responses.\n", + "- **What it does**: Steers the model to avoid generating these unwanted audio artifacts.s\n", + "- **How to adapt**: Adjust the instruction to try to explicitly suppress the specific sound patterns you are encountering." + ] + }, + { + "cell_type": "markdown", + "id": "c22c1c32", + "metadata": {}, + "source": [ + "### Example\n", + "```\n", + "# Instructions/Rules\n", + "...\n", + "- Do not include any sound effects or onomatopoeic expressions in your responses.\n", + "```" + ] + }, { "cell_type": "markdown", "id": "ea96cb72", diff --git a/images/oob_transcription.png b/images/oob_transcription.png new file mode 100644 index 0000000000..35503af901 Binary files /dev/null and b/images/oob_transcription.png differ diff --git a/registry.yaml b/registry.yaml index d1207b571e..bfebed7d39 100644 --- a/registry.yaml +++ b/registry.yaml @@ -4,6 +4,18 @@ # should build pages for, and indicates metadata such as tags, creation date and # authors for each page. +- title: Transcribing User Audio with a Separate Realtime Request + path: examples/Realtime_out_of_band_transcription.ipynb + date: 2025-11-20 + authors: + - minh-hoque + tags: + - realtime + - transcription + - voice + - speech + - audio + - title: Self-Evolving Agents - A Cookbook for Autonomous Agent Retraining path: examples/partners/self_evolving_agents/autonomous_agent_retraining.ipynb date: 2025-11-04