<a href="https://colab.research.google.com/github/jeffheaton/app_generative_ai/blob/main/t81_559_class_13_2_text2speech.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# T81-559: Applications of Generative Artificial Intelligence
**Module 13: Speech Processing**
* Instructor: [Jeff Heaton](https://sites.wustl.edu/jeffheaton/), McKelvey School of Engineering, [Washington University in St. Louis](https://engineering.wustl.edu/Programs/Pages/default.aspx)
* For more information visit the [class website](https://sites.wustl.edu/jeffheaton/t81-558/).

# Module 13 Material

Module 13: Prompt Engineering

* Part 13.1: Intro to Speech Processing [[Video]]() [[Notebook]](t81_559_class_13_1_speech_models.ipynb)
* **Part 13.2: Text to Speech** [[Video]]() [[Notebook]](t81_559_class_13_2_text2speech.ipynb)
* Part 13.3: Speech to Text [[Video]]() [[Notebook]](t81_559_class_13_3_speech2text.ipynb)
* Part 13.4: Speech Bot [[Video]]() [[Notebook]](t81_559_class_13_4_speechbot.ipynb)
* Part 13.5: Future Directions in GenAI [[Video]]() [[Notebook]](t81_559_class_13_5_future.ipynb)


# Google CoLab Instructions

The following code ensures that Google CoLab is running and maps Google Drive if needed.

In [None]:
import os

try:
    from google.colab import drive, userdata
    COLAB = True
    print("Note: using Google CoLab")
except:
    print("Note: not using Google CoLab")
    COLAB = False

# OpenAI Secrets
if COLAB:
    os.environ["OPENAI_API_KEY"] = userdata.get('OPENAI_API_KEY')

# Install needed libraries in CoLab
if COLAB:
    !pip install langchain langchain_openai openai pydub

Note: using Google CoLab
Collecting langchain
  Downloading langchain-0.3.3-py3-none-any.whl.metadata (7.1 kB)
Collecting langchain_openai
  Downloading langchain_openai-0.2.2-py3-none-any.whl.metadata (2.6 kB)
Collecting openai
  Downloading openai-1.51.2-py3-none-any.whl.metadata (24 kB)
Collecting pydub
  Downloading pydub-0.25.1-py2.py3-none-any.whl.metadata (1.4 kB)
Collecting langchain-core<0.4.0,>=0.3.10 (from langchain)
  Downloading langchain_core-0.3.10-py3-none-any.whl.metadata (6.3 kB)
Collecting langchain-text-splitters<0.4.0,>=0.3.0 (from langchain)
  Downloading langchain_text_splitters-0.3.0-py3-none-any.whl.metadata (2.3 kB)
Collecting langsmith<0.2.0,>=0.1.17 (from langchain)
  Downloading langsmith-0.1.132-py3-none-any.whl.metadata (13 kB)
Collecting tenacity!=8.4.0,<9.0.0,>=8.1.0 (from langchain)
  Downloading tenacity-8.5.0-py3-none-any.whl.metadata (1.2 kB)
Collecting tiktoken<1,>=0.7 (from langchain_openai)
  Downloading tiktoken-0.8.0-cp310-cp310-manylinux_2_17_

# Part 13.2: Text to Speech

In this module, we'll explore the fascinating world of text-to-speech (TTS) Large Language Models (LLMs), focusing on OpenAI's cutting-edge offerings. We'll primarily utilize OpenAI's TTS-1 model, a powerful and versatile tool designed for converting written text into natural-sounding speech. TTS-1 is optimized for real-time applications, making it ideal for scenarios that require low-latency audio generation. This model represents a significant advancement in speech synthesis technology, leveraging deep learning techniques to produce high-quality, lifelike vocal outputs. By delving into TTS-1, we'll explore its capabilities, examine its practical applications, and understand how it's revolutionizing various industries, from accessibility solutions to interactive voice responses and beyond.


### Simple Text to Speech Example

his code snippet demonstrates how to use OpenAI's text-to-speech API to generate spoken audio from text. First, it imports the necessary libraries: openai for API interaction, IPython.display for audio playback in Jupyter notebooks, and base64 for encoding. The TEXT variable contains the message to be converted to speech. The openai.audio.speech.create() function is called with three parameters: the model ("tts-1"), the voice ("alloy"), and the input text. OpenAI offers several voice options, including:

* **alloy** - neutral
* **echo** - young
* **fable** - male
* **onyx** - deep male
* **nova** - female
* **shimmer** - warm female

Each voice has its unique characteristics, allowing users to choose the most suitable one for their application. Additionally, OpenAI provides a high-definition model called "tts-1-hd" for enhanced audio quality, though it may have higher latency. The function returns a response object, from which the audio content is extracted and stored in the audio_data variable for further processing or playback.

In [None]:
import openai
import IPython.display as ipd
import base64

TEXT = "Hello there, I am one of the OpenAI chat voices, how are you?"

response = openai.audio.speech.create(
  model="tts-1",
  voice="alloy",
  input=TEXT
)

# Get the audio content
audio_data = response.content

We can play this audio to the CoLab notebook user.

In [None]:
from IPython.display import Audio, display

# Play the audio in Colab
print("Playing audio:")
display(Audio(audio_data, autoplay=True))

Playing audio:


We can also save an audio file.

In [None]:
with open("audio.mp3", "wb") as f:
    f.write(audio_data)


We can download this audio file.

In [None]:
# prompt: How do I download an audio file I generated named audio.mp3?

from google.colab import files
files.download('audio.mp3')


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

### Multiple Voices and Samples

The code demonstrates how to concatenate multiple text-to-speech responses from OpenAI's API, showcasing each of the available voices. It uses the pydub library to combine audio segments seamlessly. The script iterates through a list of six voices ("alloy", "echo", "fable", "onyx", "nova", and "shimmer"), generating a sample audio for each voice saying "Hello, I am the [voice] voice." These individual audio segments are then combined into a single audio file using AudioSegment from pydub. The resulting audio plays each voice sample in sequence, allowing listeners to hear the distinct characteristics of each voice option. This approach is particularly useful for comparing different voices or creating a demo reel of available voice options in a single, continuous audio stream

In [None]:
import io
from openai import OpenAI
from IPython.display import Audio, display
from google.colab import files
import os

# Initialize OpenAI client
client = OpenAI()

voices = ["alloy", "echo", "fable", "onyx", "nova", "shimmer"]
audio_segments = []

for voice in voices:
    text = f"Hello, I am the {voice} voice."
    response = client.audio.speech.create(
        model="tts-1",
        voice=voice,
        input=text
    )
    audio_segments.append(response.content)

# Combine audio segments
from pydub import AudioSegment

combined_audio = AudioSegment.empty()
for segment in audio_segments:
    audio = AudioSegment.from_mp3(io.BytesIO(segment))
    combined_audio += audio

# Convert the combined audio to a byte stream
buffer = io.BytesIO()
combined_audio.export(buffer, format="mp3")
buffer.seek(0)

0

Play the audio to the CoLab user.

In [None]:
# Play the audio in Colab
print("Playing audio:")
display(Audio(buffer.read(), autoplay=True))

Playing audio:


Save the audio to a file.

In [None]:
# Reset buffer position
buffer.seek(0)

# Save the audio file
output_filename = "combined_voices.mp3"
with open(output_filename, "wb") as f:
    f.write(buffer.getvalue())

print(f"\nAudio saved as {output_filename}")


Audio saved as combined_voices.mp3


Download the audio file

In [None]:
files.download(output_filename)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>