# Converting Book's Text Into An Audiobook With N Voice Actors

I need to do following tasks for which I would be using [Gemini 1.5](https://cloud.google.com/vertex-ai/generative-ai/docs/learn/overview) model inspired by my [chat sessions with ChatGPT and Claude](https://claude.ai/chat/bbc18d5d-7de9-4d69-ab6d-67c38734e6d6) and GCP's $150 credits, and also owing to [BookNLP's steps](https://github.com/booknlp/booknlp/blob/main/README.md), after I have found it very hard to do it correctly with traditional NLP tools like SpaCy.

- Character identification and name clustering (e.g., "Tom", "Tom Sawyer", "Mr. Sawyer", "Thomas Sawyer" -> TOM_SAWYER)
- Referential gender inference (TOM_SAWYER -> he/him/his)
- Dialogue attribution or quotation speaker identification with coreference resolution

Also, I would be using [GCP's TTS API](https://cloud.google.com/text-to-speech/docs/basics) for producing the audio outputs of dialogues by converting Speech Synthesis Markup Language (SSML) input into audio data and finally sticthing them together as one audiobook.

- Note: ssml must be well-formed according to: https://www.w3.org/TR/speech-synthesis/

In [None]:
! pip install --upgrade --quiet google-cloud-aiplatform google-cloud-texttospeech pydub

In [None]:
# Restart kernel after installs so that your environment can access the new packages
import IPython

app = IPython.Application.instance()
app.kernel.do_shutdown(True)

{'status': 'ok', 'restart': True}

In [None]:
PROJECT_ID = "audiobook-experiment-llm"  # @param {type:"string"}
PROJECT_LOCATION = "asia-south1"  # @param {type:"string"}

In [None]:
# Additional authentication is required for Google Colab
import sys

if "google.colab" in sys.modules:
    # Authenticate user to Google Cloud
    from google.colab import auth

    # Replace 'your-project-id' with your actual GCP project ID.
    auth.authenticate_user(project_id=PROJECT_ID)
    print("Authenticated")

Authenticated


In [None]:
# testing API access to the generative AI model

import vertexai
from vertexai.generative_models import GenerativeModel

vertexai.init(project=PROJECT_ID, location=PROJECT_LOCATION)

old_days_gpt_system = """You are not a helpful assistant but a glorious autocomplete machine trained for helping in writing.
Continue writing from where user input ends"""
model = GenerativeModel(model_name="gemini-1.5-flash-001", system_instruction=old_days_gpt_system)

# prompt = """You are an expert theologist on India-born religions.
# What is the origin story of the God Shiva? Please keep your answer short."""
prompt = "Sachin Tendulkar is a very famous comedian from Pakistan who is known for his "

# Prompt tokens count
count_input_tokens = model.count_tokens(prompt)
print(f"Input Prompt Token Count: {count_input_tokens.total_tokens}")
print(f"Input Prompt Character Count: {count_input_tokens.total_billable_characters}")

# Send text to Gemini
response = model.generate_content(prompt)

# Response tokens count
response_usage_metadata = response.usage_metadata
print(f"Output Prompt Token Count: {response_usage_metadata.prompt_token_count}")
print(f"Output Candidates Token Count: {response_usage_metadata.candidates_token_count}")
print(f"Output Total Token Count: {response_usage_metadata.total_token_count}")

print(f"\n\nOutput Response: {response.text}")

Input Prompt Token Count: 18
Input Prompt Character Count: 64
Output Prompt Token Count: 44
Output Candidates Token Count: 59
Output Total Token Count: 103


Output Response: ...hilarious impersonations of politicians and his quick wit. His stand-up routines are always packed, and his YouTube channel has millions of subscribers. He is particularly known for his impersonation of the Prime Minister, which is so spot-on that it often leaves audiences in stitches. 



## Character Identification & Name Clustering

Our goal is to go beyond all the person names mentioned in the given input text and to identify all the actors in the story while accounting for .

In [None]:
prompt = """
<story>
MR. BENNET was among the earliest of those who waited on Mr. Bingley. He had always intended to visit him, though to the last always assuring his wife that he should not go; and till the evening after the visit was paid she had no knowledge of it. It was then disclosed in the following manner. Observing his second daughter employed in trimming a hat, he suddenly addressed her with,—
“I hope Mr. Bingley will like it, Lizzy.”
“We are not in a way to know what Mr. Bingley likes,” said her mother, resentfully, “since we are not to visit.”
“But you forget, mamma,” said Elizabeth, “that we shall meet him at the assemblies, and that Mrs. Long has promised to introduce him.”
“I do not believe Mrs. Long will do any such thing. She has two nieces of her own. She is a selfish, hypocritical woman, and I have no opinion of her.”
“No more have I,” said Mr. Bennet; “and I am glad to find that you do not depend on her serving you.”
Mrs. Bennet deigned not to make any reply; but, unable to contain herself, began scolding one of her daughters.
“Don’t keep coughing so, Kitty, for heaven’s sake! Have a little compassion on my nerves. You tear them to pieces.”
“Kitty has no discretion in her coughs,” said her father; “she times them ill.”
“I do not cough for my own amusement,” replied Kitty, fretfully. “When is your next ball to be, Lizzy?”
“To-morrow fortnight.”
“Ay, so it is,” cried her mother, “and Mrs. Long does not come back till the day before; so, it will be impossible for her to introduce him, for she will not know him herself.”
</story>
"""

In [None]:
# system_instruction = """
# You are an expert editor at an audiobook recording agency who is also an entity extraction and document labeling specialist.
# Your job is to identify all the acting characters in a story who should be given a voice in the recording phase later.
# You should then proceed to extract narration and dialogue content as speech-ready continuous text and attribute them to their respective speaker (i.e. speaking actor).

# Given a text segment of a short or long story or book, your task is to extract the text value of the following entities:
# --- List Of Acting Characters ---
# [
#   {
#     id: "",
#     name: "",
#     role: "narrator" | "actor" | "narrator and actor",
#     gender: "male" | "female",
#   }
# ]

# --- Story's Dialogues With Labels ---
# [
#   {
#     acting_character: {
#       id: "",
#       name: ""
#     },
#     dialogue: ""
#   }
# ]

# Please adhere to the following rules:
# - The JSON schema must be followed during the extraction.
# - The values must only include text found in the document.
# - Do not normalize any entity value.
# - If an entity is not found in the document, set the entity value to null.
# - Attribute a text segment to "narrator" role if it can not be attributed to any specific actor.
# - If a character resolves to another one in the story, update it
# - If a person's name is mentioned but they don't have any dialogue of their own then they should not be added into the list of acting characters.
# - Set the id field of an acting character to a random value and use it to uniquely identify charcters even if two or more characters are found to have the same names.
# - If an acting character is identified already in the given input list of `identified_acting_characters` then don't repeat it.
# - Make sure that you don't repeat the same character twice by doing character name clustering (e.g., "Tom", "Tom Sawyer", "Mr. Sawyer", "Thomas Sawyer" -> TOM_SAWYER).
# """

In [None]:
# Prompt engineering strategies that clearly improved the output
# 0. Start tinkering with prompts in the studio first
# 1. https://platform.openai.com/docs/guides/prompt-engineering/tactic-ask-the-model-to-adopt-a-persona
# 2. https://platform.openai.com/docs/guides/prompt-engineering/tactic-use-delimiters-to-clearly-indicate-distinct-parts-of-the-input *
#     --> character identifiation became reliably good after using clear delimiters

# On this system prompt, GPT-4o is slightly more accurate/reliable than GPT-3.5

# Gemini-1.5-flash fails as it can't infer gender field for any character;
# it also fails at inferring that "Lizzy" & "Elizabeth" are alt names of the same character;
# and it also identifies "Mr Bingley" and "Mrs Long" as actors who are only mentioned in dialogues and don't have their own dialogues

system_instruction_for_actor_identification = """
You are an expert editor at an audiobook recording agency who also handles the document labelling task to convert a book or story into script for voice actors per agency's rule book (delimited with <rules> XML tags).
Given a text segment of a story or book (delimited with <story> XML tags), your job is to identify if the story has a narrator and to extract all the acting characters (that is, characters with at least one dialogue), so that they can be given a unique voice in the recording phase later.

You must adhere to the following schema in sharing your output:
<output>
--- List of Characters for Voice Actors ---
[
  {
    id: "",
    gender: "male" | "female",
    primary_name: "",
    alt_names: [
      ""
    ],
  }
]
--- Story's narrator ---
story_narrator: {
  present: "true" | "false",
  is_actor_in_story: "actor_id"
}
</output>

<rules>
Please adhere to the following rules:
- The values must only include text found in the document. If an entity is not found in the document, set the entity value to null.
- Do not normalize any entity value.
- Attribute a text segment to "narrator" role if it can not be attributed to any specific actor and carefully notice if the narrator is an actor in the story or not.
- If a person's name is mentioned but they don't have any dialogue of their own then they should not be added into the list of acting characters.
- Set the `id` field of an acting character to a random value and use it to uniquely identify characters even if two or more characters are found to have the same names.
- Make sure that you don't repeat the same character twice by doing character name clustering (e.g., "Tom", "Tom Sawyer", "Mr. Sawyer", "Thomas Sawyer" -> Thomas Sawyer).
</rules>
"""

In [None]:
# Create the model

# See https://ai.google.dev/api/python/google/generativeai/GenerativeModel
generation_config = {
  "temperature": 0,
}

model = GenerativeModel(
    model_name="gemini-1.5-pro",
    generation_config=generation_config,
    system_instruction=system_instruction_for_actor_identification
  )

model

<vertexai.generative_models.GenerativeModel at 0x7f0517f9bb50>

In [None]:
# Send text to Gemini
response = model.generate_content(prompt)

# Response tokens count
response_usage_metadata = response.usage_metadata
print(f"Output Prompt Token Count: {response_usage_metadata.prompt_token_count}")
print(f"Output Candidates Token Count: {response_usage_metadata.candidates_token_count}")
print(f"Output Total Token Count: {response_usage_metadata.total_token_count}")

print(f"\n\n{response.text}")

Output Prompt Token Count: 831
Output Candidates Token Count: 366
Output Total Token Count: 1197


<output>
--- List of Characters for Voice Actors ---
[
  {
    id: "2c42a65f-4820-409f-a4d3-9094a002183f",
    gender: "male",
    primary_name: "Mr. Bennet",
    alt_names: [
      "Bennet"
    ],
  },
  {
    id: "9083653b-959a-4834-a019-a4a0e571a18f",
    gender: "female",
    primary_name: "Lizzy",
    alt_names: [
      "Elizabeth"
    ],
  },
  {
    id: "e298e278-d7c5-4dff-a499-d2a02a500b7f",
    gender: "female",
    primary_name: "Mrs. Bennet",
    alt_names: [
      "mamma"
    ],
  },
  {
    id: "80214214-9a8a-483f-a49a-82a09a500b8f",
    gender: "female",
    primary_name: "Kitty",
    alt_names: [],
  }
]
--- Story's narrator ---
story_narrator: {
  present: "true",
  is_actor_in_story: null
}
</output>


## Dialogue Attribution

Our goal is to accurately segment the text into spoken lines and associate each line with the correct character or speaker.

This can be a crucial step in tasks like dialogue summarization, character analysis, and story understanding in NLP pipelines for literary works or scriptwriting.

In [None]:
# # Prompt engineering strategies that clearly improved the output
# # 0. Start tinkering with prompts in the studio first
# # 1. https://platform.openai.com/docs/guides/prompt-engineering/tactic-ask-the-model-to-adopt-a-persona
# # 2. https://platform.openai.com/docs/guides/prompt-engineering/tactic-use-delimiters-to-clearly-indicate-distinct-parts-of-the-input *
# #     --> character identifiation became reliably good after using clear delimiters
# # 3. https://platform.openai.com/docs/guides/prompt-engineering/tactic-specify-the-steps-required-to-complete-a-task **
# #     --> Separated the sub-task of dialogue attribution into two distinct sub-tasks (2nd and 3rd)

# # On this system prompt, GPT-4o works nicely with temperature set to 0.2 instead of being set to 0
# # But Gemini-1.5-pro fails with the same system prompt in removing dialogue tags from SSML markup content.

# openai_system_instruction_for_dialogue_attr = """
# You are an expert editor at an audiobook recording agency who also handles the document labelling task to convert text from a book or story into script for a cast of voice actors adhering to your agency's rule book (delimited with <rules> XML tags).

# Given a text segment of a story or book (delimited with <story> XML tags), your job involves the following tasks:

#   1. You should infer the cast so that they can be given a unique voice actor in the recording phase later by
#     a. extracting all the acting characters (that is, characters with at least one dialogue) and their genders (delimited in output with <cast-for-voice-actors> tags)
#     b. identifying if the story has a narrator and if an acting character itself is that narrator too (delimited in output with <story-narrator> tags)

#   2. You should prepare story scripts to be assigned to corresponding voice actors by extracting dialogues exactly as written by the story author and correctly attributing them to their respective acting characters or the narrator with original text preserved including the dialogue tags meant for reading such as "he said", "she shouted in fear", etc (delimited in output with <script-for-voice-actors> tags and saved under `dialogue_content.story_author_text` field)

#   3. You should prepare audio samples for remote rehearsals of all the voice actors by labelling dialogues in the "Speech Synthesis Markup Language" (SSML) format which can later be given to a "Text To Speech" service. In doing this, you should use context and dialogue tags to infer the emotion and subtext in each spoken line or phrase (separated by dialogue tags or pauses) from the `dialogue_content.story_author_text` field and finally transform that line or phrase for speech synthesis by using SSML tags like <speech>, <break>, <say-as>, <emphasis> and <prosody> with appropriate attributes for rate, pitch, etc (delimited in output with <script-for-voice-actors> tags and saved under `dialogue_content. text_to_speech_markup` field)

# You must adhere to the following schema in sharing your output:
# <output>
# <cast-for-voice-actors>
# [
#   {
#     id: "",
#     gender: "male" | "female",
#     primary_name: "",
#     alt_names: null | [
#       ""
#     ],
#   }
# ]
# </cast-for-voice-actors>
# <story-narrator>
# story_narrator: {
#   present: "true" | "false",
#   is_actor_in_story: "actor_id"
# }
# </story-narrator>
# <script-for-voice-actors>
# [
#   {
#     actor: "narrator" | {
#       id: "",
#       primary_name: "",
#     },
#     dialogue_content: {
#       story_author_text: "",
#       text_to_speech_markup: ""
#     }
#   }
# ]
# </script-for-voice-actors>
# </output>

# <rules>
# Please adhere to the following rules:
# - The values must only include text found in the document. If an entity is not found in the document, set the entity value to null. And, do not normalize any entity value.
# - Attribute a text segment to "narrator" role if it can not be attributed to any specific actor and carefully notice if the narrator is an actor in the story or not.
# - If a person's name is mentioned but they don't have any dialogue of their own then they should not be added into the list of acting characters.
# - Set the `id` field of an acting character to a random value and use it to uniquely identify characters even if two or more characters are found to have the same names.
# - Make sure that you don't repeat the same character twice by doing character name clustering (e.g., "Tom", "Tom Sawyer", "Mr. Sawyer", "Thomas Sawyer" -> Thomas Sawyer).
# </rules>
# """

In [None]:
# Prompt engineering strategies that clearly improved the output
# 0. Start tinkering with prompts in the studio first
# 1. https://platform.openai.com/docs/guides/prompt-engineering/tactic-ask-the-model-to-adopt-a-persona
# 2. https://platform.openai.com/docs/guides/prompt-engineering/tactic-use-delimiters-to-clearly-indicate-distinct-parts-of-the-input *
#     --> character identifiation became reliably good after using clear delimiters
# 3. https://platform.openai.com/docs/guides/prompt-engineering/tactic-specify-the-steps-required-to-complete-a-task **
#     --> Separated the sub-task of dialogue attribution into two distinct sub-tasks (2nd and 3rd)
# 4. https://platform.openai.com/docs/guides/prompt-engineering/tactic-provide-examples
#     --> This worked only after I added that example under the 3rd instruction step
# 5. Added doubles quotes around every key of the JSON schema without which JSON parsing will fail in the downstream

# Gemini needs more work on prompt "clarity" than GPT-4o -- see instruction 3rd

gemini_system_instruction_for_dialogue_attr = """
You are an expert editor at an audiobook recording agency who also handles the document labelling task to convert text from a book or story into script for a cast of voice actors adhering to your agency's rule book (delimited with <rules> XML tags).

Given a text segment of a story or book (delimited with <story> XML tags), your job involves the following tasks:

  1. You should infer story's cast so that they can be given a unique voice actor in the recording phase later by
    a. extracting all the acting characters (that is, characters with at least one dialogue) and their genders (delimited in output with <cast-for-voice-actors> tags)
    b. identifying if the story has a narrator and if an acting character itself is that narrator too (delimited in output with <story-narrator> tags)

  2. You should extract dialogues and correctly attribute them to their respective acting characters or the narrator with original text preserved including the dialogue tags meant for reading such as "he said", "she shouted in fear", etc. You should attribute a text segment to "narrator" role only if it can not be attributed to any specific actor and carefully notice if the narrator is an actor in the story or not. (delimited in output with <script-for-voice-actors> tags and saved under `dialogue_content.story_author_text` field)

  3. You should take the dialogue in plaintext from the `dialogue_content.story_author_text` field and convert it into Speech Synthesis Markup Language (SSML) format suitable for Google Cloud's "Text To Speech" service. Wrap each logical phrase of the spoken dialogue in <speak></speak> tags. Within <speak></speak>, you should infer the emotion and subtext from dialogue tags and context to transform that for speech synthesis by using SSML tags like <break>, <say-as>, <emphasis> and <prosody> with appropriate attributes for rate, pitch, etc. (delimited in output with <script-for-voice-actors> tags and saved under `dialogue_content. text_to_speech_markup` field)

    - For instance:
      - Input: "I feel lost," Mark said, his voice trembling, "but I can't go back now."
      - Output: <speak><prosody rate="85%">I feel lost,</prosody></speak><speak><break time="200ms"/><prosody rate="90%" pitch="+5%">but I can't go back now.</prosody></speak>

  4. You should take the dialogue in plaintext from the `dialogue_content.story_author_text` field and remove dialogue tags separating different logical phrases of the spoken dialogue. (delimited in output with <script-for-voice-actors> tags and saved under `dialogue_content.actor_speech_text` field)
    - For instance:
      - Input: "I feel lost," Mark said, his voice trembling, "but I can't go back now."
      - Output: "I feel lost, but I can't go back now."

You must adhere to the following schema in sharing your output:
<output>
<cast-for-voice-actors>
[
  {
    "id": "",
    "gender": "male" | "female",
    "primary_name": "",
    "alt_names": null | [
      ""
    ],
  }
]
</cast-for-voice-actors>
<story-narrator>
{
  "present": "true" | "false",
  "is_actor_in_story": null | actor_id
}
</story-narrator>
<script-for-voice-actors>
[
  {
    "actor": "narrator" | {
      "id": "",
      "primary_name": "",
    },
    "dialogue_content": {
      "story_author_text": "",
      "actor_speech_text": "",
      "text_to_speech_markup": ""
    }
  }
]
</script-for-voice-actors>
</output>

<rules>
Please adhere to the following rules:
- The values must only include text found in the document. If an entity is not found in the document, set the entity value to null. And, do not normalize any entity value.
- If a person's name is mentioned but they don't have any dialogue of their own then they should not be added into the list of acting characters.
- Set the `id` field of an acting character to a random value and use it to uniquely identify characters even if two or more characters are found to have the same names.
- Make sure that you don't repeat the same character twice by doing character name clustering (e.g., "Tom", "Tom Sawyer", "Mr. Sawyer", "Thomas Sawyer" -> Thomas Sawyer).
</rules>
"""

In [None]:
# Create the model

# See https://ai.google.dev/api/python/google/generativeai/GenerativeModel
generation_config = {
  "temperature": 0,
}

model = GenerativeModel(
    model_name="gemini-1.5-pro",
    generation_config=generation_config,
    system_instruction=gemini_system_instruction_for_dialogue_attr
  )

model

<vertexai.generative_models.GenerativeModel at 0x7f04fd86c1c0>

In [None]:
# Send text to Gemini
response = model.generate_content(prompt)

# Response tokens count
response_usage_metadata = response.usage_metadata
print(f"Output Prompt Token Count: {response_usage_metadata.prompt_token_count}")
print(f"Output Candidates Token Count: {response_usage_metadata.candidates_token_count}")
print(f"Output Total Token Count: {response_usage_metadata.total_token_count}")

print(f"\n\n{response.text}")

Output Prompt Token Count: 1478
Output Candidates Token Count: 2821
Output Total Token Count: 4299


<output>
<cast-for-voice-actors>
[
  {
    "id": "bd81a33f-30e9-4cf0-a6a1-81580e70eb20",
    "gender": "male",
    "primary_name": "MR. BENNET",
    "alt_names": [
      "Mr. Bennet"
    ]
  },
  {
    "id": "8f2a7595-fa5f-4a71-925f-610f54967c35",
    "gender": "female",
    "primary_name": "Lizzy",
    "alt_names": [
      "Elizabeth"
    ]
  },
  {
    "id": "e5a961a8-712f-4099-a76b-222a28e3087b",
    "gender": "female",
    "primary_name": "Mrs. Bennet",
    "alt_names": null
  },
  {
    "id": "c27e3fb6-c00f-489c-9514-a49a092a859f",
    "gender": "female",
    "primary_name": "Kitty",
    "alt_names": null
  }
]
</cast-for-voice-actors>
<story-narrator>
{
  "present": "true",
  "is_actor_in_story": null
}
</story-narrator>
<script-for-voice-actors>
[
  {
    "actor": "narrator",
    "dialogue_content": {
      "story_author_text": "MR. BENNET was among the earliest of those who wait

In [None]:
import re
import json

def extract_json_between_xml_delimiters(text, tag):
    # Define the pattern for the given XML tag
    pattern = re.compile(rf'<{tag}>(.*?)</{tag}>', re.DOTALL)

    # Search for the pattern in the given text
    match = pattern.search(text)

    # If a match is found, extract the JSON string
    if match:
        json_string = match.group(1).strip()

        # Load the JSON string into a Python dictionary
        try:
            # Using json.loads to parse the JSON string into a dictionary
            json_object = json.loads(json_string)
            return json_object
        except json.JSONDecodeError as e:
            print(f"JSON decode error: {e}")
            return None

    # If no match is found, return None
    return None

response_text = response.text

# Extract narrator
story_narrator = extract_json_between_xml_delimiters(response_text, "story-narrator")
print("Narrator =", story_narrator)

# Extract voice actors
cast_for_voice_actors = extract_json_between_xml_delimiters(response_text, "cast-for-voice-actors")
print("cast_for_voice_actors =", cast_for_voice_actors)

# Extract dialogues
script_for_voice_actors = extract_json_between_xml_delimiters(response_text, "script-for-voice-actors")
print("script_for_voice_actors[:5] =", script_for_voice_actors[:5])

Narrator = {'present': 'true', 'is_actor_in_story': None}
cast_for_voice_actors = [{'id': 'bd81a33f-30e9-4cf0-a6a1-81580e70eb20', 'gender': 'male', 'primary_name': 'MR. BENNET', 'alt_names': ['Mr. Bennet']}, {'id': '8f2a7595-fa5f-4a71-925f-610f54967c35', 'gender': 'female', 'primary_name': 'Lizzy', 'alt_names': ['Elizabeth']}, {'id': 'e5a961a8-712f-4099-a76b-222a28e3087b', 'gender': 'female', 'primary_name': 'Mrs. Bennet', 'alt_names': None}, {'id': 'c27e3fb6-c00f-489c-9514-a49a092a859f', 'gender': 'female', 'primary_name': 'Kitty', 'alt_names': None}]
script_for_voice_actors[:5] = [{'actor': 'narrator', 'dialogue_content': {'story_author_text': 'MR. BENNET was among the earliest of those who waited on Mr. Bingley. He had always intended to visit him, though to the last always assuring his wife that he should not go; and till the evening after the visit was paid she had no knowledge of it. It was then disclosed in the following manner. Observing his second daughter employed in trimming

## Text To Speech

In [None]:
"""Synthesizes speech from the input string of text or ssml.
Make sure to be working in a virtual environment.

Note: ssml must be well-formed according to:
    https://www.w3.org/TR/speech-synthesis/
"""

# Build the voice request, select the language code and the ssml voice gender
voice = tts.VoiceSelectionParams(
    language_code="en-US", ssml_gender=tts.SsmlVoiceGender.FEMALE
)

# Select the type of audio file you want returned
audio_config = tts.AudioConfig(
    audio_encoding=tts.AudioEncoding.MP3
)

def get_synthesized_speech(synthesisInput, file_name):
  print(synthesisInput)
  # Perform the text-to-speech request on the text input with the selected
  # voice parameters and audio file type
  response = client.synthesize_speech(
      input=synthesisInput, voice=voice, audio_config=audio_config
  )

  # The response's audio_content is binary.
  with open(file_name, "wb") as out:
      # Write the response to the output file.
      out.write(response.audio_content)
      print('Audio content written to file')

# Set the text input to be synthesized
input_text = """Here are S S M L samples.
I can pause [3 second pause].
I can speak in cardinals. Your number is ten.
Or I can speak in ordinals. You are tenth in line.
Or I can even speak in digits. The digits for ten are one oh.
I can also substitute phrases, like the World Wide Web Consortium.
Finally, I can speak a paragraph with two sentences. This is sentence one. This is sentence two.
"""
synthesis_input_text = tts.SynthesisInput(text=input_text)

input_ssml = """
<speak>
  Here are <say-as interpret-as="characters">SSML</say-as> samples.
  I can pause <break time="3s"/>.
  I can speak in cardinals. Your number is <say-as interpret-as="cardinal">10</say-as>.
  Or I can speak in ordinals. You are <say-as interpret-as="ordinal">10</say-as> in line.
  Or I can even speak in digits. The digits for ten are <say-as interpret-as="characters">10</say-as>.
  I can also substitute phrases, like the <sub alias="World Wide Web Consortium">W3C</sub>.
  Finally, I can speak a paragraph with two sentences.
  <p><s>This is sentence one.</s><s>This is sentence two.</s></p>
</speak>
"""
synthesis_input_ssml = tts.SynthesisInput(ssml=input_ssml)

get_synthesized_speech(synthesis_input_text, "test-text.wav")
get_synthesized_speech(synthesis_input_ssml, "test-ssml.wav")


text: "Here are S S M L samples. \nI can pause [3 second pause].\nI can speak in cardinals. Your number is ten.\nOr I can speak in ordinals. You are tenth in line.\nOr I can even speak in digits. The digits for ten are one oh.\nI can also substitute phrases, like the World Wide Web Consortium.\nFinally, I can speak a paragraph with two sentences. This is sentence one. This is sentence two.\n"

Audio content written to file
ssml: "\n<speak>\n  Here are <say-as interpret-as=\"characters\">SSML</say-as> samples.\n  I can pause <break time=\"3s\"/>.\n  I can speak in cardinals. Your number is <say-as interpret-as=\"cardinal\">10</say-as>.\n  Or I can speak in ordinals. You are <say-as interpret-as=\"ordinal\">10</say-as> in line.\n  Or I can even speak in digits. The digits for ten are <say-as interpret-as=\"characters\">10</say-as>.\n  I can also substitute phrases, like the <sub alias=\"World Wide Web Consortium\">W3C</sub>.\n  Finally, I can speak a paragraph with two sentences.\n  <p><

In [None]:
## Assign voice to each acting character and the narrator

import google.cloud.texttospeech as tts

# Instantiates a client
client = tts.TextToSpeechClient()

# Map to store selected voices
selected_voices = {}

# Dictionary to store available voices
available_voices = {}

# Fetch available voices in the "Studio" model
available_voices = client.list_voices().voices
available_male_voices = []
available_female_voices = []

# Fetch available voices for preferred languages
preferred_lang_codes = ['en-US', 'en-GB', 'en-AU', 'en-IN']
for language_code in preferred_lang_codes:
    voices_in_this_lang = client.list_voices(language_code=language_code).voices

    # filter out voices which don't accept SSML input otherwise it'd cause errors downstream
    studio_male_voices = [voice.name for voice in voices_in_this_lang if voice.ssml_gender == tts.SsmlVoiceGender.MALE and "Studio" in voice.name]
    available_male_voices += studio_male_voices
    wavenet_male_voices = [voice.name for voice in voices_in_this_lang if voice.ssml_gender == tts.SsmlVoiceGender.MALE and "Wavenet" in voice.name]
    available_male_voices += wavenet_male_voices

    studio_female_voices = [voice.name for voice in voices_in_this_lang if voice.ssml_gender == tts.SsmlVoiceGender.FEMALE and "Studio" in voice.name]
    available_female_voices += studio_female_voices
    wavenet_female_voices = [voice.name for voice in voices_in_this_lang if voice.ssml_gender == tts.SsmlVoiceGender.FEMALE and "Wavenet" in voice.name]
    available_female_voices += wavenet_female_voices

print("available_male_voices =", available_male_voices)
print("available_female_voices =", available_female_voices)

# Dictionary to store selected voices
selected_voices = {}

# List to store used voice IDs
used_voice_ids = []

# Function to get a voice matching the gender and preferred language
def get_voice(gender):
  if gender == 'male':
    for voice_name in available_male_voices:
      if voice_name not in used_voice_ids:
        used_voice_ids.append(voice_name)
        return voice_name
  else:
    for voice_name in available_female_voices:
      if voice_name not in used_voice_ids:
        used_voice_ids.append(voice_name)
        return voice_name
    return None

# First select a cool voice for the narrator
narrator_voice_name = get_voice('male')
selected_voices['Narrator'] = narrator_voice_name

# Select voices for characters
for character in cast_for_voice_actors:
    character_name = character['id']
    gender = character['gender']
    voice_name = get_voice(gender)
    selected_voices[character_name] = voice_name

# Print selected voices
print("\nSelected Voices:")
for character_id, voice_name in selected_voices.items():
    print(f"{character_id}: {voice_name}")

available_male_voices = ['en-US-Studio-Q', 'en-US-Wavenet-A', 'en-US-Wavenet-B', 'en-US-Wavenet-D', 'en-US-Wavenet-I', 'en-US-Wavenet-J', 'en-GB-Studio-B', 'en-GB-Wavenet-B', 'en-GB-Wavenet-D', 'en-AU-Wavenet-B', 'en-AU-Wavenet-D', 'en-IN-Wavenet-B', 'en-IN-Wavenet-C']
available_female_voices = ['en-US-Studio-O', 'en-US-Wavenet-C', 'en-US-Wavenet-E', 'en-US-Wavenet-F', 'en-US-Wavenet-G', 'en-US-Wavenet-H', 'en-GB-Studio-C', 'en-GB-Wavenet-A', 'en-GB-Wavenet-C', 'en-GB-Wavenet-F', 'en-AU-Wavenet-A', 'en-AU-Wavenet-C', 'en-IN-Wavenet-A', 'en-IN-Wavenet-D']

Selected Voices:
Narrator: en-US-Studio-Q
bd81a33f-30e9-4cf0-a6a1-81580e70eb20: en-US-Wavenet-A
8f2a7595-fa5f-4a71-925f-610f54967c35: en-US-Studio-O
e5a961a8-712f-4099-a76b-222a28e3087b: en-US-Wavenet-C
c27e3fb6-c00f-489c-9514-a49a092a859f: en-US-Wavenet-E


In [None]:
## Create speech audio for each dialogue and then stitch them all together to get a single audiobook

from pydub import AudioSegment

def synthesize_text(text, voice_name, speaking_rate=1.0):
    input_text = tts.SynthesisInput(text=text)
    voice = tts.VoiceSelectionParams(
        name=voice_name,
        language_code=voice_name[:5]
    )
    audio_config = tts.AudioConfig(
        audio_encoding=tts.AudioEncoding.MP3,
        speaking_rate=speaking_rate
    )
    response = client.synthesize_speech(
        input=input_text, voice=voice, audio_config=audio_config
    )
    return response.audio_content

# Create a list to store the audio segments
audio_segments = []

for idx, dialogue in enumerate(script_for_voice_actors):
    actor = dialogue['actor']

    if actor == "narrator":
      actor_name = 'Narrator'
    else:
      actor_name = actor['id']

    voice_name = selected_voices[actor_name]
    text_to_speech_markup = dialogue['dialogue_content']['actor_speech_text'] # ['text_to_speech_markup']

    # Synthesize speech
    audio_content = synthesize_text(text_to_speech_markup, voice_name)

    # Save audio content to a temporary file
    temp_file_name = f'text_dialog_{idx+1}.mp3'
    with open(temp_file_name, 'wb') as out:
        out.write(audio_content)
        print('Audio content written to file')

    # Load the audio file into an AudioSegment
    audio_segment = AudioSegment.from_mp3(temp_file_name)

    # Append to the list of audio segments
    audio_segments.append(audio_segment)

    # Add a break between segments
    audio_segments.append(AudioSegment.silent(duration=100))

# Combine all audio segments into one final audio file
final_audio = sum(audio_segments)

# Export the final audio file
final_audio.export("text-dialog-audiobook.mp3", format="mp3")

print("Audiobook creation complete.")

Audio content written to file
Audio content written to file
Audio content written to file
Audio content written to file
Audio content written to file
Audio content written to file
Audio content written to file
Audio content written to file
Audio content written to file
Audio content written to file
Audio content written to file
Audio content written to file
Audiobook creation complete.
