### Overview

This notebook generates audio ads based on input text (ad script text) and target languages where the ad is translated to, before generating speech.

### Requirements

To run this notebook, you need a Google Cloud Project with the following APIs enabled:

* [Text to Speech](https://cloud.google.com/text-to-speech)
* [Vertex AI](https://cloud.google.com/vertex-ai?hl=en)

You must also have billing enabled for your project.

### How to use this notebook:

1.   Run all the installation steps in the given order.
2.   Input a text for an ad in a given language and select desired output languages.
3.   The LLM (Vertex AI) translates the text in the required languages and generates Speech Synthesis Markup Language (SSML). This helps to make the text sound more human once it is generated as an ad (speech).
4. The user receives a list of audio files as output, which can be downloaded for ads trafficking.
5. Optionally, the user can select another voice (aside from the default) for a given language.
6. Optionally, the user can change the SSML for the speech to sound more like their preference.




Run these cells before going to step 1)

# Installation steps:

In [None]:
#@title Install the Google Cloud libraries to work with
!pip install --upgrade google-cloud-texttospeech
!pip install google-cloud-aiplatform

In [None]:
#@title Import the required modules
from google.cloud import texttospeech
from google.colab import auth
import ipywidgets as widgets
import vertexai
from vertexai.generative_models import GenerativeModel
from IPython.display import Audio
import ipywidgets as widgets
from IPython.display import display, clear_output
from numpy import array_split

In [None]:
#@title Specify your Google Cloud Project id
project_id='youtube-channel-classifier' # @param {type:"string"}
location = 'europe-west4' # @param ["us-west1", "us-west2", "us-west3", "us-west4", "us-central1", "us-east1", "us-east4", "us-south1", "northamerica-northeast1", "northamerica-northeast2", "southamerica-east1", "southamerica-west1", "europe-west2", "europe-west1", "europe-west4", "europe-west6", "europe-west3", "europe-north1", "europe-central2", "europe-west8", "europe-west9", "europe-southwest1", "asia-south1", "asia-southeast1", "asia-southeast2", "asia-east2", "asia-east1", "asia-northeast1", "asia-northeast2", "australia-southeast1", "australia-southeast2", "asia-northeast3", "me-west1"]

auth.authenticate_user(project_id=project_id)
vertexai.init(project=project_id, location=location)

In [None]:
# @title Set-up script and API's

# Get text-to-speech client
client = texttospeech.TextToSpeechClient()


def get_voices():
  """Retrieves all the voices available from the Text To Speech API.

  Returns:
      A list of tuples (language_code, voice_name), where each tuple
      represents a voice available in the Text-to-Speech API.
  """

  # Fetch all voices from the Text-to-Speech API
  response = client.list_voices()

  # Extract language code and voice name for each voice
  voices = [(voice.language_codes[0], voice.name) for voice in response.voices]

  return voices


def prompt(input_text: str, target_language: str) -> str:
  """Generates SSML with a Large Language Model for the given input text and target language.

  Args:
      input_text (str): The input text to translate and convert to SSML.
      target_language (str): The target language for the translation.

  Returns:
      str: The SSML-formatted text for the given input text and target language.
  """
  parameters = {
      "temperature": (
          0.0
      ),  # Temperature controls the degree of randomness in token selection.
      "max_output_tokens": (
          256
      ),  # Token limit determines the maximum amount of text output.
      "top_p": (
          0.8
      ),  # Tokens are selected from most probable to least until the sum of their probabilities equals the top_p value.
      "top_k": (
          40
      ),  # A top_k of 1 means the selected token is the most probable among all tokens.
  }
  model = GenerativeModel("gemini-1.0-pro-001")
  responses = model.generate_content(
      [f"""
        context: you take an input text for an audio ad into the requested language.
          You will receive and "input_text" and "target_language". You'll translate the text in the requested language and return the translated sentence in ssml format for the gcp text-to-speech API.
          Adjust the SSML parameters such as speed and tone to make the audio more realist, compelling and sales like.
          In the case of ca-ES translate to Spanish. Never define voice name in the SSML.

        examples:

            input: text: Vertex AI. Innovate faster with enterprise-ready generative AI. Vertex AI offers everything you need to build and use generative AI—from AI solutions, to Search and Conversation, to 100+ foundation models, to a unified AI platform. New customers get $300 in free credits to spend on Vertex AI.
            language: french

            output: <speak>
            <prosody pitch=\"+10%\" rate=\"115%\" >Vertex AI.</prosody>
            <prosody pitch=\"-5%\" rate=\"110%\" >Innovez plus rapidement grâce à une IA générative adaptée aux entreprises. </prosody>
            <prosody pitch=\"+10%\" rate=\"115%\" > Vertex AI offre tout ce dont vous avez besoin pour créer et utiliser l\'IA générative, des solutions d\'IA à la recherche et à la conversation, en passant par plus de 100 modèles de base et une plate-forme d\'IA unifiée.</prosody>
            <prosody pitch=\"-5%\" rate=\"120%\" >Les nouveaux clients bénéficient de 300 $ de crédits gratuits à dépenser sur Vertex AI.</prosody>
            </speak>

            input: text: Vertex AI. Innovate faster with enterprise-ready generative AI. Vertex AI offers everything you need to build and use generative AI—from AI solutions, to Search and Conversation, to 100+ foundation models, to a unified AI platform.
            language: german

            output: <speak>
            <prosody pitch=\"+10%\" rate=\"115%\" >Vertex AI.</prosody>
            <prosody pitch=\"-5%\" rate=\"110%\" >Schnellere Innovationen mit generativer KI für Unternehmen. </prosody>
            <prosody pitch=\"+10%\" rate=\"115%\" Vertex AI bietet alles, was Sie zum Erstellen und Verwenden von generativer KI benötigen – von KI-Lösungen über die Suche und Unterhaltung bis hin zu über 100 Foundation Models bis hin zu einer einheitlichen KI-Plattform.</prosody>
            <prosody pitch=\"-5%\" rate=\"120%\"> Neukunden erhalten ein Startguthaben von 300 $ für Vertex AI.</prosody>
            </speak>

            input: text: Text-to-Speech AI. Convert text into natural-sounding speech using an API powered by the best of Google’s AI technologies. New customers get $300 in free credits to spend on Text-to-Speech. Improve customer interactions with intelligent, lifelike responses. Engage users with voice user interface in your devices and applications. Personalize your communication based on user preference of voice and language: spanish

            output: <speak>
            <prosody pitch=\"+5%\" rate=\"120%\" > IA de Text‐to‐Speech. </prosody>
            <prosody pitch=\"-2%\" rate=\"105%\" >Convierte texto en voz que suena natural con una API basada en las mejores tecnologías de IA de Google.</prosody>
            <prosody pitch=\"-5%\" rate=\"120%\"> Los nuevos clientes reciben 300 USD en crédito gratis para utilizarlo en Text‐to‐Speech.</prosody>
            <prosody pitch=\"-5%\" rate=\"110%\"> Mejora las interacciones con los clientes gracias a respuestas naturales e inteligentes. </prosody>
            <prosody pitch=\"0%\" rate=\"105%\"> Capta el interés de los usuarios con una interfaz de voz en tus dispositivos y aplicaciones. </prosody>
            <prosody pitch=\"5%\" rate=\"110%\">Personaliza tus comunicaciones en función de las preferencias de voz e idioma del usuario. </prosody>
            </speak>

        input_text: {input_text}
        target_language: {target_language}
        output:
        """],
      generation_config=parameters,
  )
  return responses.candidates[0].content.parts[0].text


def get_language_voices(language):
  """Retrieves all voices available for the specified language.

  Args:
      language (str): The target language.

  Returns:
      list: A list of tuples (language_code, voice_name), where each tuple
      represents a voice available for the specified language.
  """
  out = []
  voices = get_voices()
  for voice in voices:
    if language in voice[1]:
      out.append(voice)
  return out


def get_voice_ssml_pairs(selected_languages, input_text):
  """Generates SSML for each voice in the given selected languages.

  Args:
      selected_languages (list): A list of target languages.
      input_text (str): The input text to translate and convert to SSML.

  Returns:
      list: A list of tuples (voice_name, ssml_text), where each tuple
      represents the SSML-formatted text for the given input text and a voice
      in one of the selected languages.
  """
  output = []
  for selected_language in selected_languages:
    script = prompt(input_text, selected_language)
    for voice in get_language_voices(selected_language):
      output.append((voice[1], script))
  return output


def get_default_voice_ssml_pairs(selected_languages, input_text):
  """Generates SSML for each default voice in the given selected languages.

  Args:
      selected_languages (list): A list of target languages.
      input_text (str): The input text to translate and convert to SSML.

  Returns:
      list: A list of tuples (voice_name, ssml_text), where each tuple
      represents the SSML-formatted text for the given input text and the
      default voice for one of the selected languages.
  """
  output = []
  for selected_language in selected_languages:
    script = prompt(input_text, selected_language)
    output.append((languages_default_voice.get(selected_language), script))
  return output


def synthesize_speech(text, voice_name):
  """Synthesizes speech using the Text-to-Speech API.

  Args:
      text (str): The text to synthesize.
      voice_name (str): The name of the voice to use.

  Returns:
      Audio: An object representing the synthesized speech.
  """
  language_code = "-".join(voice_name.split("-")[:2])
  synthesis_input = texttospeech.SynthesisInput(ssml=text)
  voice_params = texttospeech.VoiceSelectionParams(
      language_code=language_code, name=voice_name
  )
  audio_config = texttospeech.AudioConfig(
      audio_encoding=texttospeech.AudioEncoding.MP3
  )
  try:
    response = client.synthesize_speech(
        input=synthesis_input, voice=voice_params, audio_config=audio_config
    )
    return Audio(data=response.audio_content, autoplay=False)
  except Exception as err:
    print("Handling run-time error:", err)
    synthesize_errors.append(err)
    pass


# Initialize empty lists for languages and default voices
languages = []
languages_default_voice = dict()

# Iterate over all available voices
for voice in get_voices():
  # Extract language pair from voice information
  language_pair = voice[0]

  # Add language to list if not already present
  if language_pair not in languages:
    languages.append(language_pair)

    # Set default voice for the language
    languages_default_voice[language_pair] = voice[1]

# Send prompts to LLM and SSML responses to TTS API
synthesize_errors = []


# Speech Generation


In [None]:
#@title Generate speech from input text and language selection

class SpeechGenerator:
    """
    Generates speech from input text and selected languages.

    Args:
        languages (list): A list of available languages.
        input_value (str): The initial input text.
        input_description (str): Description for the input text field.

    Attributes:
        languages (list): List of available languages.
        checkboxes (list): List of `ipywidgets.Checkbox` objects for language selection.
        input_description (str): Description for the input text field.
        input_value (str): Initial input text.
        layout (ipywidgets.VBox): Combined layout of UI elements.

    Methods:
        _setup_widgets(): Creates and configures UI elements.
        _setup_layout(): Arranges UI elements into a layout.
        _setup_callbacks(): Defines event handlers for UI interactions.
        on_generate_click(button): Handles button click event to generate speech.
        display(): Displays the UI widget.
    """

    def __init__(self, languages, input_value, input_description):
        self.languages = languages
        self.checkboxes = []
        self.input_description = input_description
        self.input_value = input_value

        self._setup_widgets()
        self._setup_layout()
        self._setup_callbacks()

    def _setup_widgets(self):
        # Create input text field
        self.ad_text = widgets.Textarea(
            value=self.input_value,
            description=self.input_description,
            layout=widgets.Layout(width='1000px', height='100px')
        )

        # Create and configure checkboxes for language selection
        chunks = array_split(self.languages, 4)
        for sublist in chunks:
            for language in sublist:
                checkbox = widgets.Checkbox(value=False, description=language)
                self.checkboxes.append(checkbox)

        # Create and configure button for generating speech
        self.generate_button = widgets.Button(description="Generate", layout=widgets.Layout(width='200px'))

    def _setup_layout(self):
        # Divide checkboxes into 4 groups for layout
        chunk_size = len(self.checkboxes) // 4
        boxes = [widgets.VBox(self.checkboxes[i:i + chunk_size]) for i in range(0, len(self.checkboxes), chunk_size)]

        # Combine UI elements into a single layout
        self.layout = widgets.VBox([
            self.ad_text,
            widgets.HBox(boxes),
            self.generate_button,
        ])

    def _setup_callbacks(self):
        # Register callback for button click event
        self.generate_button.on_click(self.on_generate_click)

    def on_generate_click(self, button):
        # Update selected languages based on checkbox states
        self.selected_languages = [checkbox.description for checkbox in self.checkboxes if checkbox.value]
        if not self.selected_languages:
            print("Please select at least one language")
            return

        # Generate and display speech for each selected language
        print(f'Generating speech for: {", ".join(self.selected_languages)}')
        pairs = get_default_voice_ssml_pairs(self.selected_languages, self.ad_text.value)

        for voice_name, ssml in pairs:
            print(f"Voice: {voice_name}")
            print(f"SSML: {ssml}")
            display(synthesize_speech(ssml, voice_name))
            print("\n")

    def display(self):
        # Display the combined UI layout
        return display(self.layout)

# Example usage
SpeechGeneratorText = SpeechGenerator(languages, 'Replace me with your text', 'Script:')
SpeechGeneratorText.display()


In [None]:
#@title Generate speech and edit input SSML

SpeechGeneratorSSML = SpeechGenerator(languages, 'Replace me with your SSML', 'SSML:')
SpeechGeneratorSSML.display()

In [None]:
#@title Select different voices and optimize SSML

from ipywidgets import Dropdown

def update_second_dropdown(change):
  """
  Updates the options of the second dropdown based on the selected value in the first dropdown.

  Args:
      change: A dictionary containing information about the change event, including the new value of the first dropdown.
  """
  selected_language = change["new"]  # Get the selected language from the first dropdown
  new_options = get_language_voices(selected_language)  # Get the corresponding options based on the selected language
  new_options = [element[1:][0] for element in new_options]
  second_dropdown.options = new_options  # Update the second dropdown's options

ad_text = widgets.Textarea(
            value='Your SSML',
            description='SSML:',
            disabled=False,
            layout=widgets.Layout(width='1000px', height='100px')
        )

def on_generate_click(button):
  print(ad_text.value)
  print(second_dropdown.value)
  display(synthesize_speech(ad_text.value, second_dropdown.value))




# Define the first and second dropdowns
first_dropdown = Dropdown(options=languages,value=None, description='Select language')
second_dropdown = Dropdown(options=[], description='Select voice')  # Initially empty

# Connect the first dropdown to the update function
first_dropdown.observe(update_second_dropdown, names="value")

# Display both dropdowns
display(ad_text)
display(first_dropdown)
display(second_dropdown)
generate_button = widgets.Button(description="Generate Voice", layout=widgets.Layout(width='200px'))
generate_button.on_click(on_generate_click)
display(generate_button)

