# Using Speech Generation (Text-to-Speech) with Gemini

This notebook demonstrates how to use the Gemini API to generate speech from text. The API allows for both single-speaker and multi-speaker audio generation with control over style, accent, pace, and tone using natural language.

Text-to-speech (TTS) generation is currently in Preview.

**Supported Models:**
*   `gemini-2.5-flash-preview-tts`
*   `gemini-2.5-pro-preview-tts`

For more details, see the [official documentation](https://ai.google.dev/gemini-api/docs/speech-generation).


In [None]:
%pip install google-genai soundfile numpy

In [8]:
import os
from google import genai
from google.genai import types
import soundfile as sf
import numpy as np
from IPython.display import Audio, display

client = genai.Client(api_key=os.environ["GEMINI_API_KEY"])

# helper to save audio file 
def save_audio_with_soundfile(pcm_byte_data, filename="output.wav", sample_rate=24000):
    """Saves PCM byte data to a WAV file using the soundfile library."""
    audio_array = np.frombuffer(pcm_byte_data, dtype=np.int16)
    sf.write(filename, audio_array, sample_rate)
    return filename


## Single-speaker text-to-speech

This example demonstrates how to convert text to single-speaker audio. We'll set the `response_modality` to `"AUDIO"` and provide a `SpeechConfig` with a chosen `voice_name`.

In [19]:
single_speaker_prompt = """
Say cheerfully "Have a wonderful day! I wish you a great day!"
[Pause]
Say angry "What the hell was that?"
"""

response = client.models.generate_content(
   model="gemini-2.5-flash-preview-tts",
   contents=single_speaker_prompt,
   config=types.GenerateContentConfig(
      response_modalities=["AUDIO"],
      speech_config=types.SpeechConfig(
         voice_config=types.VoiceConfig(
            prebuilt_voice_config=types.PrebuiltVoiceConfig(
               voice_name='Kore', # Choose from https://ai.google.dev/gemini-api/docs/speech-generation#voices
            )
         )
      ),
   )
)

# save as file and display
filename = save_audio_with_soundfile(response.candidates[0].content.parts[0].inline_data.data)
display(Audio(filename))


## Multi-speaker text-to-speech

This example shows how to generate audio with multiple speakers. We provide a transcript with speaker names and can guide the style for each speaker.

In [20]:
multi_speaker_prompt = """Conversation between Joe and Sarah. Make Joe sound tired and bored, and Sarah sound excited and happy:

Joe: So... what's on the agenda today?
Sarah: You're never going to guess!
"""

response = client.models.generate_content(
   model="gemini-2.5-flash-preview-tts",
   contents=multi_speaker_prompt,
   config=types.GenerateContentConfig(
      response_modalities=["AUDIO"],
      speech_config=
      types.SpeechConfig(
         multi_speaker_voice_config=types.MultiSpeakerVoiceConfig(
            speaker_voice_configs=[
               types.SpeakerVoiceConfig(
                  speaker='Joe',
                  voice_config=types.VoiceConfig(
                     prebuilt_voice_config=types.PrebuiltVoiceConfig(
                        voice_name='Enceladus',
                     )
                  )
               ),
               types.SpeakerVoiceConfig(
                  speaker='Sarah',
                  voice_config=types.VoiceConfig(
                     prebuilt_voice_config=types.PrebuiltVoiceConfig(
                        voice_name='Leda', 
                     )
                  )
               ),
            ]
         )
      )
   )
)

# save as file and display
filename = save_audio_with_soundfile(response.candidates[0].content.parts[0].inline_data.data)
display(Audio(filename))