# 3.4 OpenAI API Deep Dive — Audio & Speech APIs (Whisper STT & TTS)

## Playground Notebook

OpenAI provides two audio capabilities that form a complete voice pipeline:

| API | Direction | What It Does |
|-----|-----------|-------------|
| **Whisper (STT)** | Audio \u2192 Text | Transcribe or translate spoken audio |
| **TTS** | Text \u2192 Audio | Convert text into lifelike spoken audio |

Together: **Voice In \u2192 LLM Processing \u2192 Voice Out**

> **Model:** `gpt-4o-mini` for text processing, `whisper-1` for STT, `tts-1` for TTS.

---

In [1]:
import os
import time
from pathlib import Path
from dotenv import load_dotenv
from openai import OpenAI
from IPython.display import display, Markdown, HTML, Audio

load_dotenv()

MODEL = "gpt-4o-mini"
client = OpenAI()

print(f"\u2705 Client ready")

✅ Client ready


In [2]:
# ============================================================
#  HELPER FUNCTIONS
# ============================================================

def chat(messages, max_tokens=150, **kwargs):
    """Send messages to OpenAI and display the response."""
    start = time.time()
    response = client.chat.completions.create(
        model=MODEL,
        messages=messages,
        max_tokens=max_tokens,
        **kwargs
    )
    elapsed = time.time() - start
    content = response.choices[0].message.content
    display(Markdown(content))
    print(f"\n\u23f1\ufe0f {elapsed:.2f}s | Tokens: {response.usage.prompt_tokens}+{response.usage.completion_tokens}={response.usage.total_tokens}")
    return response


def show_messages(messages):
    """Pretty-print the message list being sent (handles both dicts and OpenAI objects)."""
    colors = {"system": "#e74c3c", "user": "#3498db", "assistant": "#2ecc71", "tool": "#f39c12"}
    html = ""
    for msg in messages:
        if isinstance(msg, dict):
            role = msg.get("role", "unknown")
            content = msg.get("content", "")
        else:
            role = getattr(msg, "role", "unknown")
            content = getattr(msg, "content", None)
        if not content:
            tool_calls = msg.get("tool_calls", None) if isinstance(msg, dict) else getattr(msg, "tool_calls", None)
            if tool_calls:
                content = ", ".join(f"{tc.function.name}({tc.function.arguments})" for tc in tool_calls)
                content = f"[tool_calls] {content}"
            else:
                content = "(empty)"
        if len(str(content)) > 200:
            content = str(content)[:200] + "..."
        color = colors.get(role, "#888")
        html += (
            f'<div style="margin:6px 0;padding:8px 12px;border-left:4px solid {color};'
            f'background:#1e1e1e;border-radius:4px;">'
            f'<strong style="color:{color};text-transform:uppercase;">{role}</strong>'
            f'<br><span style="color:#ccc;">{content}</span></div>'
        )
    display(HTML(html))


print("\u2705 Helpers loaded")

✅ Helpers loaded


---

## 1. Text-to-Speech (TTS) — Generating Audio

We start with TTS so we can **create audio files** to use in the Whisper (STT) experiments later.

### Available Models & Voices

| Model | Quality | Speed | Cost |
|-------|---------|-------|------|
| `tts-1` | Good | Fast | $15/1M chars |
| `tts-1-hd` | Best | Slower | $30/1M chars |

| Voice | Character |
|-------|-----------|
| `alloy` | Neutral, balanced |
| `echo` | Warm, conversational |
| `fable` | Expressive, British |
| `onyx` | Deep, authoritative |
| `nova` | Friendly, youthful |
| `shimmer` | Soft, gentle |

### Experiment 1A: Basic TTS — Generate Speech

In [3]:
# Generate speech from text
start = time.time()

response = client.audio.speech.create(
    model="tts-1",
    voice="nova",
    input="Hello! Welcome to the OpenAI Audio API playground."
)

# Save to file
speech_file = "tts_demo.mp3"
response.stream_to_file(speech_file)

elapsed = time.time() - start
file_size = os.path.getsize(speech_file)

print(f"\u2705 Generated: {speech_file}")
print(f"   Size: {file_size:,} bytes")
print(f"   Time: {elapsed:.2f}s")

# Play in notebook
display(Audio(speech_file))

✅ Generated: tts_demo.mp3
   Size: 55,680 bytes
   Time: 4.51s


  response.stream_to_file(speech_file)


### Experiment 1B: Compare Different Voices

In [4]:
text = "The quick brown fox jumps over the lazy dog."
voices = ["alloy", "echo"]

for voice in voices:
    print(f"\n{'=' * 40}")
    print(f"  Voice: {voice}")
    print(f"{'=' * 40}")

    resp = client.audio.speech.create(
        model="tts-1", voice=voice, input=text
    )
    fname = f"tts_{voice}.mp3"
    resp.stream_to_file(fname)
    display(Audio(fname))


  Voice: alloy


  resp.stream_to_file(fname)



  Voice: echo


### Experiment 1C: Output Formats

| Format | Use Case |
|--------|----------|
| `mp3` | Default, good compression |
| `opus` | Streaming, low latency |
| `aac` | Apple devices |
| `flac` | Lossless, archival |
| `wav` | Uncompressed, editing |
| `pcm` | Raw audio bytes |

In [5]:
# Compare file sizes across formats
text = "This is a test of different audio formats."

for fmt in ["mp3", "aac", "flac"]:
    resp = client.audio.speech.create(
        model="tts-1", voice="alloy", input=text,
        response_format=fmt
    )
    fname = f"tts_format.{fmt}"
    resp.stream_to_file(fname)
    size = os.path.getsize(fname)
    print(f"  {fmt:5s} \u2192 {size:>8,} bytes")

print("\n\u2139\ufe0f mp3 is best for general use; flac for lossless quality.")

  resp.stream_to_file(fname)


  mp3   →   53,280 bytes
  aac   →   18,663 bytes
  flac  →   74,411 bytes

ℹ️ mp3 is best for general use; flac for lossless quality.


### Experiment 1D: Speed Control

In [6]:
text = "Speed can be adjusted between zero point two five and four point zero."

for speed in [0.75, 1.0, 1.5]:
    print(f"\n--- Speed: {speed}x ---")
    resp = client.audio.speech.create(
        model="tts-1", voice="nova", input=text, speed=speed
    )
    fname = f"tts_speed_{speed}.mp3"
    resp.stream_to_file(fname)
    display(Audio(fname))


--- Speed: 0.75x ---


  resp.stream_to_file(fname)



--- Speed: 1.0x ---



--- Speed: 1.5x ---


---

## 2. Speech-to-Text (Whisper) — Transcription

Whisper is OpenAI's speech recognition model. Two modes:

| Mode | What It Does |
|------|-------------|
| **Transcription** | Audio \u2192 Text (same language) |
| **Translation** | Audio (any language) \u2192 English text |

### Experiment 2A: Transcribe the Audio We Generated

In [7]:
# Transcribe the TTS file we created earlier
print("Transcribing tts_demo.mp3...\n")

start = time.time()
with open("tts_demo.mp3", "rb") as audio_file:
    transcript = client.audio.transcriptions.create(
        model="whisper-1",
        file=audio_file
    )
elapsed = time.time() - start

print(f"\u2705 Transcription: {transcript.text}")
print(f"\u23f1\ufe0f Time: {elapsed:.2f}s")

Transcribing tts_demo.mp3...

✅ Transcription: Hello, welcome to the OpenAI Audio API Playground.
⏱️ Time: 0.82s


### Experiment 2B: Transcription Parameters

In [8]:
# Generate a longer audio for testing parameters
resp = client.audio.speech.create(
    model="tts-1", voice="alloy",
    input="Python is a programming language created by Guido van Rossum. It was first released in 1991."
)
resp.stream_to_file("whisper_test.mp3")

# Transcribe with parameters
with open("whisper_test.mp3", "rb") as f:
    result = client.audio.transcriptions.create(
        model="whisper-1",
        file=f,
        language="en",               # hint the language for better accuracy
        prompt="Python, Guido van Rossum",  # hint for proper nouns
        temperature=0.0              # 0 = most accurate, higher = more creative
    )

print(f"\u2705 Result: {result.text}")

  resp.stream_to_file("whisper_test.mp3")


✅ Result: Python is a programming language created by Guido van Rossum. It was first released in 1991.


### Experiment 2C: Output Formats — SRT Subtitles & Verbose JSON

In [9]:
# SRT format — ready for subtitles
with open("whisper_test.mp3", "rb") as f:
    srt_result = client.audio.transcriptions.create(
        model="whisper-1",
        file=f,
        response_format="srt"
    )

print("SRT (Subtitle) Format:")
print("-" * 40)
print(srt_result)

print("\n" + "=" * 50)

# Verbose JSON — with timestamps
with open("whisper_test.mp3", "rb") as f:
    verbose = client.audio.transcriptions.create(
        model="whisper-1",
        file=f,
        response_format="verbose_json"
    )

print("\nVerbose JSON Format:")
print("-" * 40)
print(f"Language: {verbose.language}")
print(f"Duration: {verbose.duration}s")
for seg in verbose.segments:
    # TranscriptionSegment object has attributes rather than dict keys
    start = getattr(seg, 'start', None) if not isinstance(seg, dict) else seg.get('start')
    end = getattr(seg, 'end', None) if not isinstance(seg, dict) else seg.get('end')
    text = getattr(seg, 'text', None) if not isinstance(seg, dict) else seg.get('text', '')
    # guard against None values
    if start is None or end is None:
        print(f"  [??] {text}")
    else:
        print(f"  [{start:.1f}s - {end:.1f}s] {text}")

SRT (Subtitle) Format:
----------------------------------------
1
00:00:00,000 --> 00:00:03,800
Python is a programming language created by Guido van Rossum.

2
00:00:03,800 --> 00:00:06,240
It was first released in 1991.





Verbose JSON Format:
----------------------------------------
Language: english
Duration: 6.519999980926514s
  [0.0s - 3.8s]  Python is a programming language created by Guido van Rossum.
  [3.8s - 6.2s]  It was first released in 1991.


### Experiment 2D: Translation — Any Language \u2192 English

In [10]:
# Generate audio in another language-ish (TTS will read phonetically)
resp = client.audio.speech.create(
    model="tts-1", voice="nova",
    input="Bonjour! Je suis un assistant vocal."
)
resp.stream_to_file("french_test.mp3")
display(Audio("french_test.mp3"))

# Translate to English
with open("french_test.mp3", "rb") as f:
    translation = client.audio.translations.create(
        model="whisper-1",
        file=f
    )

print(f"\n\u2705 Translation (\u2192 English): {translation.text}")

  resp.stream_to_file("french_test.mp3")



✅ Translation (→ English): Bonjour, je suis un assistant vocal.


---

## 3. Full Voice Pipeline — Whisper + GPT + TTS

Combine all three to build a complete **voice-in, voice-out** assistant.

In [11]:
def voice_assistant(audio_path):
    """Complete voice pipeline: STT -> LLM -> TTS"""
    print("\u250c\u2500 VOICE PIPELINE \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510")

    # Step 1: STT — Audio to Text
    t1 = time.time()
    with open(audio_path, "rb") as f:
        transcript = client.audio.transcriptions.create(model="whisper-1", file=f)
    stt_time = time.time() - t1
    print(f"\u2502 1. STT:  \"{transcript.text}\" ({stt_time:.2f}s)")

    # Step 2: LLM — Process with GPT
    messages = [
        {"role": "system", "content": "You are a helpful voice assistant. Reply in 1-2 short sentences."},
        {"role": "user", "content": transcript.text}
    ]
    print(f"\u2502 2. LLM:")
    show_messages(messages)

    t2 = time.time()
    response = chat(messages, max_tokens=60)
    reply = response.choices[0].message.content
    llm_time = time.time() - t2

    # Step 3: TTS — Text to Audio
    t3 = time.time()
    speech = client.audio.speech.create(model="tts-1", voice="nova", input=reply)
    speech.stream_to_file("pipeline_output.mp3")
    tts_time = time.time() - t3
    print(f"\u2502 3. TTS:  pipeline_output.mp3 ({tts_time:.2f}s)")

    total = stt_time + llm_time + tts_time
    print(f"\u2502")
    print(f"\u2502 Total: {total:.2f}s")
    print(f"\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518")
    return reply


# Create a user audio input
print("=" * 60)
print("  Full Voice Pipeline: STT \u2192 GPT \u2192 TTS")
print("=" * 60)

resp = client.audio.speech.create(
    model="tts-1", voice="echo",
    input="What is the tallest mountain in the world?"
)
resp.stream_to_file("user_question.mp3")

print("\nUser audio:")
display(Audio("user_question.mp3"))
print()

# Run the full pipeline
_ = voice_assistant("user_question.mp3")

print("\nAssistant audio:")
display(Audio("pipeline_output.mp3"))

  Full Voice Pipeline: STT → GPT → TTS

User audio:


  resp.stream_to_file("user_question.mp3")



┌─ VOICE PIPELINE ────────────────────────────────┐
│ 1. STT:  "What is the tallest mountain in the world?" (1.46s)
│ 2. LLM:


The tallest mountain in the world is Mount Everest, which stands at 8,848.86 meters (29,031.7 feet) above sea level.


⏱️ 1.09s | Tokens: 36+32=68
│ 3. TTS:  pipeline_output.mp3 (3.38s)
│
│ Total: 5.94s
└───────────────────────────────────────────────┘

Assistant audio:


  speech.stream_to_file("pipeline_output.mp3")


---

## Whisper Limits & Supported Formats

| Aspect | Details |
|--------|---------|
| **Max file size** | 25 MB per request |
| **Formats** | mp3, mp4, mpeg, mpga, m4a, wav, webm, ogg, flac |
| **Cost** | $0.006 per minute |
| **Large files** | Split into chunks before sending |

---

## Key Takeaways

| Concept | What to Remember |
|---------|------------------|
| **TTS** | `client.audio.speech.create()` — text \u2192 audio file |
| **6 voices** | alloy, echo, fable, onyx, nova, shimmer |
| **Speed** | 0.25x to 4.0x — adjust for use case |
| **Whisper STT** | `client.audio.transcriptions.create()` — audio \u2192 text |
| **Translation** | `client.audio.translations.create()` — any language \u2192 English |
| **Formats** | text, srt, vtt, verbose_json for timestamps |
| **Prompt hint** | Pass proper nouns to improve accuracy |
| **Voice Pipeline** | STT \u2192 GPT \u2192 TTS = complete voice assistant |