### Overview

This notebook generates audio ads based on input text (ad script text) and target languages where the ad is translated to, before generating speech.

### Requirements

To run this notebook, you need a Google Cloud Project with the following APIs enabled:

* [Text to Speech](https://cloud.google.com/text-to-speech)
* [Vertex AI](https://cloud.google.com/vertex-ai?hl=en)

You must also have billing enabled for your project.

### How to use this notebook:

1.   Run all the installation steps in the given order.
2.   Input a text for an ad in a given language and select desired output languages.
3.   The LLM (Vertex AI) translates the text in the required languages and generates Speech Synthesis Markup Language (SSML). This helps to make the text sound more human once it is generated as an ad (speech).
4. The user receives a list of audio files as output, which can be downloaded for ads trafficking.
5. Optionally, the user can select another voice (aside from the default) for a given language.
6. Optionally, the user can change the SSML for the speech to sound more like their preference.




Run these cells before going to step 1)

# Installation steps:

In [None]:
#@title Install the Google Cloud libraries to work with
!pip install --upgrade google-cloud-texttospeech
!pip install google-cloud-aiplatform

In [None]:
#@title Import the required modules
from google.cloud import texttospeech
from google.colab import auth
import ipywidgets as widgets
import vertexai
from vertexai.language_models import TextGenerationModel
from IPython.display import Audio

In [None]:
#@title Specify your Google Cloud Project id
project_id='' # @param {type:"string"}
location = 'europe-west4' # @param ["us-west1", "us-west2", "us-west3", "us-west4", "us-central1", "us-east1", "us-east4", "us-south1", "northamerica-northeast1", "northamerica-northeast2", "southamerica-east1", "southamerica-west1", "europe-west2", "europe-west1", "europe-west4", "europe-west6", "europe-west3", "europe-north1", "europe-central2", "europe-west8", "europe-west9", "europe-southwest1", "asia-south1", "asia-southeast1", "asia-southeast2", "asia-east2", "asia-east1", "asia-northeast1", "asia-northeast2", "australia-southeast1", "australia-southeast2", "asia-northeast3", "me-west1"]

auth.authenticate_user(project_id=project_id)

In [None]:
#@title Instantiate the Text to Speech Client
client = texttospeech.TextToSpeechClient()

In [None]:
#@title Retrieve all the voices from the Text To Speech API
def get_voices():
    """Retrieves all the voices from the Text To Speech API.

    Returns:
        A list of tuples (language_code, voice_name), where each tuple
        represents a voice available in the Text To Speech API.
    """
    response = client.list_voices()
    voices = [(voice.language_codes[0], voice.name)
              for voice in response.voices]
    return voices

In [None]:
#@title Fetch available languages and set a default voice per language
languages = []
languages_default_voice = dict()
# uncomment this code if you want to use your own defaults, and change it
# language_default_voice = {'af-ZA': 'af-ZA-Standard-A',
#                           'bg-BG': 'bg-BG-Standard-A'}
for voice in get_voices():
    language_pair = voice[0]
    if language_pair not in languages:
        languages.append(language_pair)
        languages_default_voice[language_pair] = voice[1]

print(languages_default_voice)

In [None]:
#@title Send prompt to LLM and send the answer SSML to TTS API
synthesize_errors = []
vertexai.init(project=project_id, location=location)

def prompt(
    input_text: str,
    target_language: str
) -> str:
    """Generates SSML with a Large Language Model for the given input text and target language.

    Args:
        input_text (str): The input text to translate and convert to SSML.
        target_language (str): The target language for the translation.

    Returns:
        str: The SSML-formatted text for the given input text and target language.
    """
    parameters = {
        "temperature": 0.0,  # Temperature controls the degree of randomness in token selection.
        "max_output_tokens": 256,  # Token limit determines the maximum amount of text output.
        "top_p": 0.8,  # Tokens are selected from most probable to least until the sum of their probabilities equals the top_p value.
        "top_k": 40,  # A top_k of 1 means the selected token is the most probable among all tokens.
    }

    model = TextGenerationModel.from_pretrained("text-bison@001")
    response = model.predict(
        f"""
        context: you take an input text for an audio ad into the requested language.
          You will receive and "input_text" and "target_language". You'll translate the text in the requested language and  return the translated sentence in ssml format for the gcp text-to-speech API.
          Adjust the SSML parameters such as speed and tone to make the audio more realist, compelling and sales like.
          In the case of ca-ES translate to Spanish. Never define voice name in the SSML.

        examples:

            input: text: Vertex AI. Innovate faster with enterprise-ready generative AI. Vertex AI offers everything you need to build and use generative AI—from AI solutions, to Search and Conversation, to 100+ foundation models, to a unified AI platform. New customers get $300 in free credits to spend on Vertex AI.
            language: french


            output: <speak>
            <prosody pitch=\"+10%\" rate=\"115%\" >Vertex AI.</prosody>
            <prosody pitch=\"-5%\" rate=\"110%\" >Innovez plus rapidement grâce à une IA générative adaptée aux entreprises. </prosody>
            <prosody pitch=\"+10%\" rate=\"115%\" > Vertex AI offre tout ce dont vous avez besoin pour créer et utiliser l\'IA générative, des solutions d\'IA à la recherche et à la conversation, en passant par plus de 100 modèles de base et une plate-forme d\'IA unifiée.</prosody>
            <prosody pitch=\"-5%\" rate=\"120%\" >Les nouveaux clients bénéficient de 300 $ de crédits gratuits à dépenser sur Vertex AI.</prosody>
            </speak>

            output: <speak>
            <prosody pitch=\"+10%\" rate=\"90%\" volume=\"100%\">
                    <emphasis level=\"strong\">Entdecken</emphasis> Sie das Geheimnis hinter <emphasis level=\"moderate\">Heinekens</emphasis>  erfrischendem <emphasis level=\"strong\">ersten Schluck</emphasis>.
                </prosody>
                <break time=\"1s\"/>
                <prosody pitch=\"0%\" rate=\"95%\" volume=\"100%\">
                    Dazu gehören die berühmte A-Hefe von Heineken, hochwertiges reines Malz, 100 % natürliche Zutaten, ein innovatives horizontales Brauverfahren und die erfahrenen Star-Brauer von Heineken – ein Titel, für den man 10 Jahre braucht.
                </prosody>
            </speak>

            input: text: Vertex AI. Innovate faster with enterprise-ready generative AI. Vertex AI offers everything you need to build and use generative AI—from AI solutions, to Search and Conversation, to 100+ foundation models, to a unified AI platform.
            language: german


            output: <speak>
            <prosody pitch=\"+10%\" rate=\"115%\" >Vertex AI.</prosody>
            <prosody pitch=\"-5%\" rate=\"110%\" >Schnellere Innovationen mit generativer KI für Unternehmen. </prosody>
            <prosody pitch=\"+10%\" rate=\"115%\" Vertex AI bietet alles, was Sie zum Erstellen und Verwenden von generativer KI benötigen – von KI-Lösungen über die Suche und Unterhaltung bis hin zu über 100 Foundation Models bis hin zu einer einheitlichen KI-Plattform.</prosody>
            <prosody pitch=\"-5%\" rate=\"120%\"> Neukunden erhalten ein Startguthaben von 300 $ für Vertex AI.</prosody>
            </speak>


            input: text: Text-to-Speech AI. Convert text into natural-sounding speech using an API powered by the best of Google’s AI technologies. New customers get $300 in free credits to spend on Text-to-Speech. Improve customer interactions with intelligent, lifelike responses. Engage users with voice user interface in your devices and applications. Personalize your communication based on user preference of voice and language: spanish


            output: <speak>
            <prosody pitch=\"+5%\" rate=\"120%\" > IA de Text‐to‐Speech. </prosody>
            <prosody pitch=\"-2%\" rate=\"105%\" >Convierte texto en voz que suena natural con una API basada en las mejores tecnologías de IA de Google.</prosody>
            <prosody pitch=\"-5%\" rate=\"120%\"> Los nuevos clientes reciben 300 USD en crédito gratis para utilizarlo en Text‐to‐Speech.</prosody>
            <prosody pitch=\"-5%\" rate=\"110%\"> Mejora las interacciones con los clientes gracias a respuestas naturales e inteligentes. </prosody>
            <prosody pitch=\"0%\" rate=\"105%\"> Capta el interés de los usuarios con una interfaz de voz en tus dispositivos y aplicaciones. </prosody>
            <prosody pitch=\"5%\" rate=\"110%\">Personaliza tus comunicaciones en función de las preferencias de voz e idioma del usuario. </prosody>
            </speak>


        input_text: {input_text}
        target_language: {target_language}
        output:
        """,
        **parameters,
    )
    return response.text

def get_language_voices(language):
  """Retrieves all voices available for the specified language.

  Args:
      language (str): The target language.

  Returns:
      list: A list of tuples (language_code, voice_name), where each tuple
      represents a voice available for the specified language.
  """
  out = []
  for voice in voices:
      if language in voice[1]:
        out.append(voice)
  return out

def get_voice_ssml_pairs(selected_languages, input_text):
  """Generates SSML for each voice in the given selected languages.

  Args:
      selected_languages (list): A list of target languages.
      input_text (str): The input text to translate and convert to SSML.

  Returns:
      list: A list of tuples (voice_name, ssml_text), where each tuple
      represents the SSML-formatted text for the given input text and a voice
      in one of the selected languages.
  """
  output = []
  for selected_language in selected_languages:
    script = prompt(input_text, selected_language)
    for voice in get_language_voices(selected_language):
      output.append((voice[1], script))
  return output

def get_default_voice_ssml_pairs(selected_languages, input_text):
  """Generates SSML for each default voice in the given selected languages.

  Args:
      selected_languages (list): A list of target languages.
      input_text (str): The input text to translate and convert to SSML.

  Returns:
      list: A list of tuples (voice_name, ssml_text), where each tuple
      represents the SSML-formatted text for the given input text and the
      default voice for one of the selected languages.
  """
  output = []
  for selected_language in selected_languages:
    script = prompt(input_text,selected_language)
    output.append((languages_default_voice.get(selected_language), script))
  return output

def synthesize_speech(text, voice_name):
    """Synthesizes speech using the Text-to-Speech API.

    Args:
        text (str): The text to synthesize.
        voice_name (str): The name of the voice to use.

    Returns:
        Audio: An object representing the synthesized speech.
    """
    language_code = '-'.join(voice_name.split('-')[:2])
    synthesis_input = texttospeech.SynthesisInput(ssml=text)
    voice_params = texttospeech.VoiceSelectionParams(
        language_code=language_code,
        name=voice_name
    )
    audio_config = texttospeech.AudioConfig(
        audio_encoding=texttospeech.AudioEncoding.MP3
    )
    try:
      response = client.synthesize_speech(
          input=synthesis_input,
          voice=voice_params,
          audio_config=audio_config
      )
      return Audio(data=response.audio_content, autoplay=False)
    except Exception as err:
      print('Handling run-time error:', err)
      synthesize_errors.append(err)
      pass


# How to use it?


In [None]:
#@title Replace the text with your own and select at least language below
ad_text = widgets.Textarea(
    value='Replace me with your text',
    description='Ad script:',
    disabled=False,
    layout = widgets.Layout(width='1000px', height='100px')
)

items_left = languages[:len(languages)//2]
items_right = languages[len(languages)//2:]

checkboxes_left = [widgets.Checkbox(value=False, description=item) for item in items_left]
checkboxes_right = [widgets.Checkbox(value=False, description=item) for item in items_right]

left_box = widgets.VBox(checkboxes_left)
right_box = widgets.VBox(checkboxes_right)

display(widgets.VBox([ad_text]))
display(widgets.HBox([left_box, right_box]))

In [None]:
#@title Your input is:
selected_languages = []
for checkbox in checkboxes_left + checkboxes_right:
    if checkbox.value:
        selected_languages.append(checkbox.description)

print(selected_languages)
print(ad_text.value)

In [None]:
#@title Speech generated with a default voice and LLM generated SSML:
voices = get_voices()
input_text = ad_text.value
pairs = get_default_voice_ssml_pairs(selected_languages, input_text)

for voice_name_ssml in pairs:
  print(f"Voice: {voice_name_ssml[0]}")
  # print("\n")
  print(f" SSML: {voice_name_ssml[1]}")
  display(synthesize_speech(voice_name_ssml[1], voice_name_ssml[0]))


# Customization: Select the best voice for your SSML

---



In [None]:
#@title Paste one of the SSML & select a language
from ipywidgets import widgets
ad_text_ssml_custom = widgets.Textarea(
    value='Replace me with your SSML',
    description='SSML:',
    disabled=False,
    layout = widgets.Layout(width='1000px', height='200px')
)
items_left = languages[:len(languages)//2]
items_right = languages[len(languages)//2:]

checkboxes_left = [widgets.Checkbox(value=False, description=item) for item in items_left]
checkboxes_right = [widgets.Checkbox(value=False, description=item) for item in items_right]

left_box = widgets.VBox(checkboxes_left)
right_box = widgets.VBox(checkboxes_right)

display(widgets.VBox([ad_text_ssml_custom]))
display(widgets.HBox([left_box, right_box]))



In [None]:
#@title Your selection is:
selected_languages_ssml = []
for checkbox in checkboxes_left + checkboxes_right:
    if checkbox.value:
        selected_languages_ssml.append(checkbox.description)

print(selected_languages_ssml)
print(ad_text_ssml_custom.value)

In [None]:
#@title Speech generated for the available voices in the selected language
all_voices = get_voices()
pairs = get_voice_ssml_pairs(selected_languages_ssml, input_text)
for voice_name_ssml in pairs:
  print(f"Voice: {voice_name_ssml[0]}")
  display(synthesize_speech(ad_text_ssml_custom.value, voice_name_ssml[0]))

# Customization: Change your SSML for a given voice name to make your ad perfect

In [None]:
#@title Paste one of the SSML & select a voice
from ipywidgets import widgets
ad_text_ssml = widgets.Textarea(
    value='Replace me with your SSML',
    description='SSML string for TTS:',
    disabled=False,
    layout = widgets.Layout(width='1000px', height='200px')
)
ad_text_voice_name = widgets.Textarea(
    value='Replace me with your voice name',
    description='Voice name for TTS:',
    disabled=False,
    layout = widgets.Layout(width='1000px', height='50px')
)
box = widgets.VBox([ad_text_ssml, ad_text_voice_name])
box

In [None]:
#@title Listen to & download the generated speech
synthesize_speech(ad_text_ssml.value, ad_text_voice_name.value)

Paste your SSML and get a list of all voices