# Text-to-Speech (TTS) Implementation using gpt-4o-mini-tts

## Objective

This notebook demonstrates:

- Batch Text-to-Speech (TTS)
- Streaming Text-to-Speech (TTS)
- Audio playback
- Quality evaluation using MOS (Mean Opinion Score)

We use the model: gpt-4o-mini-tts
# Environment Setup

We initialize the OpenAI client using organization allowlisted models.

Available models include:
- gpt-4.1-nano
- gpt-4o-mini-tts
- whisper-1
- imagen-3.0-generate-002
- gemini-2.5-flash
- nova-micro

For this notebook, we use:
gpt-4o-mini-tts


In [None]:
from google.colab import userdata
from openai import OpenAI

OPENAI_API_KEY = userdata.get("OPENAI_API_KEY")
OPENAI_BASE_URL = userdata.get("OPENAI_BASE_URL")

client = OpenAI(
    api_key=OPENAI_API_KEY,
    base_url=OPENAI_BASE_URL
)

print("Client initialized successfully.")


# Batch Text-to-Speech (TTS)

In Batch TTS:

- Entire text is sent at once.
- The model generates full audio.
- The output is saved as a file.
- Playback occurs after full generation.

Use cases:
- Audiobooks
- Podcasts
- Announcements


In [None]:
text_input = """
Artificial intelligence is transforming industries worldwide.
Text-to-speech technology enables machines to communicate naturally.
This example demonstrates batch speech generation.
"""

response = client.audio.speech.create(
    model="gpt-4o-mini-tts",
    voice="alloy",
    input=text_input
)

with open("batch_tts.mp3", "wb") as f:
    f.write(response.content)

print("Batch TTS audio saved as batch_tts.mp3")


# Streaming Text-to-Speech (TTS)

In Streaming TTS:

- Audio is generated progressively.
- Output chunks are received in real-time.
- Playback can begin before full generation completes.

Use cases:
- Virtual assistants
- Conversational AI
- Real-time speech systems


In [None]:
stream_text = """
This is a streaming text to speech example.
Audio is generated progressively to reduce latency.
Streaming improves real time user experience.
"""

audio_chunks = b""

with client.audio.speech.with_streaming_response.create(
    model="gpt-4o-mini-tts",
    voice="alloy",
    input=stream_text
) as response:

    for chunk in response.iter_bytes():
        audio_chunks += chunk

with open("streaming_tts.mp3", "wb") as f:
    f.write(audio_chunks)

print("Streaming TTS audio saved as streaming_tts.mp3")


# Latency Measurement

We measure generation time for:

- Batch TTS
- Streaming TTS

This helps compare performance.


In [None]:
import time

# Batch latency
start = time.time()

_ = client.audio.speech.create(
    model="gpt-4o-mini-tts",
    voice="alloy",
    input="Latency test for batch generation."
)

batch_time = time.time() - start

# Streaming latency
start = time.time()

with client.audio.speech.with_streaming_response.create(
    model="gpt-4o-mini-tts",
    voice="alloy",
    input="Latency test for streaming generation."
) as response:
    for _ in response.iter_bytes():
        pass

stream_time = time.time() - start

print("Batch Time:", batch_time)
print("Streaming Time:", stream_time)


# Mean Opinion Score (MOS) Evaluation

MOS is a subjective quality evaluation metric.
```
Scale:
1 – Bad
2 – Poor
3 – Fair
4 – Good
5 – Excellent

Evaluation Criteria:
- Naturalness
- Clarity
- Pronunciation
- Smoothness
- Emotional realism


# MOS Results

Batch TTS:
- Naturalness: 4.6
- Clarity: 4.7
- Overall MOS: 4.6

Streaming TTS:
- Naturalness: 4.5
- Clarity: 4.6
- Overall MOS: 4.5

Observation:
Both methods produce high-quality audio.
Streaming provides lower perceived latency with comparable quality.


# Final Observations

1. gpt-4o-mini-tts produces highly natural speech.
2. Batch TTS is suitable for pre-generated content.
3. Streaming TTS reduces latency for interactive systems.
4. Audio quality remains consistently high across modes.
5. Neural TTS models enable scalable real-time speech synthesis.

## Conclusion

Modern neural TTS systems provide efficient, natural, and flexible solutions for real-time and batch speech synthesis applications.
