<a href="https://colab.research.google.com/github/jonbaer/googlecolab/blob/master/YT_Gemini_TTS_Natural_Voice_AI_Studio_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install -U -q "google-genai>=1.16.1"

# Gemini TTS


### Setup your API key for Using AIStudio


In [None]:
from google.colab import userdata

GOOGLE_API_KEY=userdata.get('GOOGLE_AI_STUDIO')

### Initialize SDK client


In [None]:
from google import genai
from google.genai import types

client = genai.Client(
    api_key=GOOGLE_API_KEY,
    )

### Initialize SDK client


In [None]:
from google import genai
from google.genai import types
from google.genai.types import GenerateContentConfig, Tool
from IPython.display import display, HTML, Markdown
import io
import json
import re

### Getting a list of models

In [None]:
for model in client.models.list(config={'query_base':True}):
    if 'tts' in model.name:
        print(model)

name='models/gemini-2.5-flash-preview-tts' display_name='Gemini 2.5 Flash Preview TTS' description='Gemini 2.5 Flash Preview TTS' version='gemini-2.5-flash-exp-tts-2025-05-19' endpoints=None labels=None tuned_model_info=TunedModelInfo(base_model=None, create_time=None, update_time=None) input_token_limit=32768 output_token_limit=8192 supported_actions=['countTokens', 'generateContent'] default_checkpoint_id=None checkpoints=None
name='models/gemini-2.5-pro-preview-tts' display_name='Gemini 2.5 Pro Preview TTS' description='Gemini 2.5 Pro Preview TTS' version='gemini-2.5-pro-preview-tts-2025-05-19' endpoints=None labels=None tuned_model_info=TunedModelInfo(base_model=None, create_time=None, update_time=None) input_token_limit=65536 output_token_limit=65536 supported_actions=['countTokens', 'generateContent'] default_checkpoint_id=None checkpoints=None


## Basic Genrate Text

In [None]:
response = client.models.generate_content(
    model="gemini-2.0-flash-exp",
    contents="What is the origin of 'TTS'?"
)

Markdown(response.text)

"TTS" stands for **Text-to-Speech**.

Its origin isn't tied to a specific inventor or date, but rather a gradual evolution in technology. Here's a breakdown:

*   **Early Concepts:** The idea of machines speaking dates back centuries. Mechanical contraptions were created that could mimic human speech in a limited way.

*   **Modern Beginnings (Mid-20th Century):** The real groundwork for modern TTS was laid in the mid-20th century with the development of speech synthesis techniques using computers. Some key areas of focus were:

    *   **Formant Synthesis:** Creating speech by generating acoustic components (formants) that represent the resonant frequencies of the vocal tract.

    *   **Concatenative Synthesis:** Piecing together recorded snippets of human speech to form words and sentences.

*   **Continued Development:** From the late 20th century into the 21st, TTS systems have become much more sophisticated, driven by:

    *   **Increased Computing Power:** More complex algorithms and models became feasible.

    *   **Advances in Linguistics and Signal Processing:** Better understanding of speech production and perception.

    *   **Machine Learning (especially Deep Learning):** Neural networks have dramatically improved the naturalness and expressiveness of TTS.

**In summary:** TTS is not a single invention but a field that has evolved over time, drawing on advancements in computer science, linguistics, and engineering. The term "Text-to-Speech" itself likely emerged organically as the technology matured and became more widely used.


In [None]:
# response.candidates

## Basic TTS - Single Voice

In [None]:
from google import genai
from google.genai import types
import wave
import os
import base64
import struct


from IPython.display import Audio, display

In [None]:
# Set up the wave file to save the output:
def wave_file(filename, pcm, channels=1, rate=24000, sample_width=2):
   print(f"\nWriting audio file with parameters:")
   print(f"Channels: {channels}")
   print(f"Sample rate: {rate}")
   print(f"Sample width: {sample_width}")
   print(f"Data length: {len(pcm)} bytes")

   with wave.open(filename, "wb") as wf:
      wf.setnchannels(channels)
      wf.setsampwidth(sample_width)
      wf.setframerate(rate)
      wf.writeframes(pcm)

In [None]:
PROMPT = "Say excitedly: Thats right Gemini now has Text to speech!"

VOICE = 'Kore'

client = genai.Client(api_key=GOOGLE_API_KEY)

response = client.models.generate_content(
   model="gemini-2.5-flash-preview-tts",
   contents=PROMPT,
   config=types.GenerateContentConfig(
      response_modalities=["audio"],
      speech_config=types.SpeechConfig(
         voice_config=types.VoiceConfig(
            prebuilt_voice_config=types.PrebuiltVoiceConfig(
               voice_name=VOICE,
            )
         )
      ),
   )
)

# Debug the response structure
print("\nResponse structure:")
print(f"Number of candidates: {len(response.candidates)}")
print(f"Content parts: {len(response.candidates[0].content.parts)}")
print(f"Part type: {type(response.candidates[0].content.parts[0])}")

data = response.candidates[0].content.parts[0].inline_data.data

# decoded_data = base64.b64decode(data)


Response structure:
Number of candidates: 1
Content parts: 1
Part type: <class 'google.genai.types.Part'>


In [None]:
response.usage_metadata

GenerateContentResponseUsageMetadata(cache_tokens_details=None, cached_content_token_count=None, candidates_token_count=95, candidates_tokens_details=[ModalityTokenCount(modality=<MediaModality.AUDIO: 'AUDIO'>, token_count=95)], prompt_token_count=13, prompt_tokens_details=[ModalityTokenCount(modality=<MediaModality.TEXT: 'TEXT'>, token_count=13)], thoughts_token_count=None, tool_use_prompt_token_count=None, tool_use_prompt_tokens_details=None, total_token_count=108, traffic_type=None)

In [None]:
rate = 24000
file_name = f'single_voice_out.wav'

print(f"\nSaving sample rate: {rate}")
wave_file(file_name, data, rate=rate)


Saving sample rate: 24000

Writing audio file with parameters:
Channels: 1
Sample rate: 24000
Sample width: 2
Data length: 182926 bytes


In [None]:
audio_file_path = '/content/single_voice_out.wav'
display(Audio(audio_file_path))

In [None]:
response.usage_metadata

GenerateContentResponseUsageMetadata(cache_tokens_details=None, cached_content_token_count=None, candidates_token_count=95, candidates_tokens_details=[ModalityTokenCount(modality=<MediaModality.AUDIO: 'AUDIO'>, token_count=95)], prompt_token_count=13, prompt_tokens_details=[ModalityTokenCount(modality=<MediaModality.TEXT: 'TEXT'>, token_count=13)], thoughts_token_count=None, tool_use_prompt_token_count=None, tool_use_prompt_tokens_details=None, total_token_count=108, traffic_type=None)

### Put it together as a function

In [None]:
def generate_tts(PROMPT, VOICE, file_name):

    client = genai.Client(api_key=GOOGLE_API_KEY)

    response = client.models.generate_content(
    model="gemini-2.5-flash-preview-tts",
    contents=PROMPT,
    config=types.GenerateContentConfig(
        response_modalities=["audio"],
        speech_config=types.SpeechConfig(
            voice_config=types.VoiceConfig(
                prebuilt_voice_config=types.PrebuiltVoiceConfig(
                voice_name=VOICE,
                )
            )
        ),
    )
    )

    data = response.candidates[0].content.parts[0].inline_data.data
    # set the sample rate
    rate = 24000
    file_name = f'{file_name}.wav'

    print(f"\nSaving sample rate: {rate}")
    wave_file(file_name, data, rate=rate)

    return file_name


In [None]:
PROMPT = "whisper softly: Thats right. Gemini now has Text to speech!"
VOICE = 'Leda'
FILENAME = "leda_01"

audio_file_path = generate_tts(PROMPT, VOICE, FILENAME)


Saving sample rate: 24000

Writing audio file with parameters:
Channels: 1
Sample rate: 24000
Sample width: 2
Data length: 236686 bytes


In [None]:
display(Audio(audio_file_path))

In [None]:
PROMPT = "lauging and giggling: Thats right. Gemini now has Text to speech!"
VOICE = 'Charon'
FILENAME = "charon_01"

audio_file_path = generate_tts(PROMPT, VOICE, FILENAME)


Saving sample rate: 24000

Writing audio file with parameters:
Channels: 1
Sample rate: 24000
Sample width: 2
Data length: 330766 bytes


In [None]:
display(Audio(audio_file_path))

In [None]:
PROMPT = "stern and angrily: No more excuses you can now use Gemini TTS!"
VOICE = 'Charon'
FILENAME = "charon_02"

audio_file_path = generate_tts(PROMPT, VOICE, FILENAME)


Saving sample rate: 24000

Writing audio file with parameters:
Channels: 1
Sample rate: 24000
Sample width: 2
Data length: 282766 bytes


In [None]:
display(Audio(audio_file_path))

## Make a multi speaker podcast

In [None]:
transcript = client.models.generate_content(
   model="gemini-2.0-flash",
   contents="""Generate a short transcript around 200 words that reads
            like it was taken from a podcast by an expert of bringing back extinct animals(Jenny)
            and podcast host (David). They are talking about Jenny's team bringing back the wooly mamoth.
            The presenters will ocasionally interupt each other with their passion
            The presenters names are Jenny and David.""").text

print(f"Transcript: {transcript}")

Transcript: **(Intro music fades)**

**David:** Welcome back to "Rewilding the Planet," everyone! Today we have the incredible Dr. Jenny Chen, head of the Mammoth Revival Project! Jenny, welcome!

**Jenny:** Thanks, David! Thrilled to be here.

**David:** So, Jenny, the big question on everyone's minds: Woolly Mammoths. We’re actually talking about bringing them back! Where are we in the process?

**Jenny:** We're closer than people think! We’ve made significant progress in mapping the mammoth genome and identifying key differences from modern elephants.

**David:** Key differences that allow for, you know, surviving the ice age!

**Jenny:** Exactly! And with advancements in gene editing, we're working towards inserting those mammoth genes into elephant cells. We're focusing on traits like cold resistance, woolly hair…

**David:** (Excitedly interrupting) So, think of those thick shaggy coats we see in museums! We are going to have real life woolly mammoths!

**Jenny:** Well, it's not 

In [None]:
response = client.models.generate_content(
   model="gemini-2.5-flash-preview-tts",
   contents=transcript,
   config=types.GenerateContentConfig(
      response_modalities=["AUDIO"],
      speech_config=types.SpeechConfig(
         multi_speaker_voice_config=types.MultiSpeakerVoiceConfig(
            speaker_voice_configs=[
               types.SpeakerVoiceConfig(
                  speaker='Jenny',
                  voice_config=types.VoiceConfig(
                     prebuilt_voice_config=types.PrebuiltVoiceConfig(
                        voice_name='Kore',
                     )
                  )
               ),
               types.SpeakerVoiceConfig(
                  speaker='David',
                  voice_config=types.VoiceConfig(
                     prebuilt_voice_config=types.PrebuiltVoiceConfig(
                        voice_name='Puck',
                     )
                  )
               ),
            ]
         )
      )
   )
)

data = response.candidates[0].content.parts[0].inline_data.data


In [None]:
# set the sample rate
rate = 24000
file_name = f'multi_01.wav'

print(f"\nSaving sample rate: {rate}")
wave_file(file_name, data, rate=rate)


Saving sample rate: 24000

Writing audio file with parameters:
Channels: 1
Sample rate: 24000
Sample width: 2
Data length: 4220686 bytes


In [None]:
display(Audio(file_name))