# Different TTS engines comparisons

We tried two different TTS (text-to-speech) engines:

## pyttsx3
It works offline, uses the voices in your local system for TTS. The voices don't seem completely natural

`pip install pyttsx3`

[GitHub](https://github.com/nateshmbhat/pyttsx3)

## gTTS
It uses Google TTS and gives very natural voices. It can be saved directly to file and unlimited length. It needs internet connection.

`pip install gtts`

[GitHub](https://github.com/pndurette/gTTS)

All examples run with Python version: 3.11.3

Neither support [SSML](https://www.w3.org/TR/speech-synthesis/) so we will need to fake pauses using text like
`<speak>Let's wait for a moment. <break time="3s"/>. And we are back</speak>`

For mp3 processing, we used [pydub](https://github.com/jiaaro/pydub)
`pip install pydub`

This requires [FFmpeg](https://www.ffmpeg.org/)
For example for windows, the easiest way is to run `winget install "FFmpeg (Essentials Build)"/

# Further reading
- https://www.geeksforgeeks.org/convert-text-speech-python/
- https://pyttsx3.readthedocs.io/en/latest/engine.html#examples
- https://gtts.readthedocs.io/en/latest/module.html#localized-accents
- https://gtts.readthedocs.io/en/latest/module.html#examples

# Other options to explore
I wanted to use Tik Tok voices, this can be an interesting choice: https://rapidapi.com/dalamates/api/tts-tiktok

Note about `pyttsx3` inside a jupyter notebook.
Most of the code examples you find online have `engine.runAndWait()`. This works perfectly fine multiple times if you use a python editor e.g. IDLE or Visual Studio Code. However since we are in a jupyter notebook if we run the same cell twice we get an error: `RuntimeError: run loop already started`.

This is because a Jupyter notebook also runs an async event loop in teh background, and this conflicts with `pyttsx3.runAndWait()` which expects to control its own loop.

The easiest solution is restart the kernel (or just restart VS Code). For simple tests, this works well.
Note that I tried to iterate the engine without success, and also use threads


In [1]:
import pyttsx3

engine = pyttsx3.init()

engine.setProperty("rate", 100)
engine.setProperty("volume", 1.0)

engine.say("The red house.")
engine.say("A big dog.")

engine.runAndWait()


In [2]:
# Different voices
import pyttsx3

engine = pyttsx3.init()

voices = engine.getProperty("voices")

for voice in voices:
  print(voice)

<Voice id=HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Speech\Voices\Tokens\TTS_MS_EN-GB_HAZEL_11.0
          name=Microsoft Hazel Desktop - English (Great Britain)
          languages=[]
          gender=None
          age=None>
<Voice id=HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Speech\Voices\Tokens\TTS_MS_EN-US_ZIRA_11.0
          name=Microsoft Zira Desktop - English (United States)
          languages=[]
          gender=None
          age=None>
<Voice id=HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Speech\Voices\Tokens\TTS_MS_ES-ES_HELENA_11.0
          name=Microsoft Helena Desktop - Spanish (Spain)
          languages=[]
          gender=None
          age=None>


In my computer I get these results:
```ini
<Voice id=HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Speech\Voices\Tokens\TTS_MS_EN-GB_HAZEL_11.0
          name=Microsoft Hazel Desktop - English (Great Britain)
          languages=[]
          gender=None
          age=None>
<Voice id=HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Speech\Voices\Tokens\TTS_MS_EN-US_ZIRA_11.0
          name=Microsoft Zira Desktop - English (United States)
          languages=[]
          gender=None
          age=None>
<Voice id=HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Speech\Voices\Tokens\TTS_MS_ES-ES_HELENA_11.0
          name=Microsoft Helena Desktop - Spanish (Spain)
          languages=[]
          gender=None
          age=None>
```

In [1]:
import pyttsx3

engine = pyttsx3.init()

engine.setProperty("rate", 100)
engine.setProperty("volume", 1.0)

voices = engine.getProperty("voices")

for voice in voices:
  print(f"Voice: {voice.name}")
  engine.setProperty('voice', voice.id)
  engine.say("The red house.")
  engine.say("A big dog.")

engine.runAndWait()

Voice: Microsoft Hazel Desktop - English (Great Britain)
Voice: Microsoft Zira Desktop - English (United States)
Voice: Microsoft Helena Desktop - Spanish (Spain)


In [1]:
# Since pyttsx3 doesn't support SSML we try with extra individual dots (or ellipsis)
# Having dots or ellipsis doesn't make a difference in the length of the pause
import pyttsx3

engine = pyttsx3.init()

engine.setProperty("rate", 100)
engine.setProperty("volume", 1.0)

engine.say("The red house....................")
engine.say("A big dog.")

engine.say("The white house … … … …")
engine.say("A small dog.")

engine.runAndWait()


In [1]:
# This worked the first time, but not the second i.e. useless for running inside a Jupyter notebook multiple times
import threading
import pyttsx3

def speak_sentences(sentences: list[str]) -> None:
  engine = pyttsx3.init()
  for sentence in sentences:
    engine.say(sentence)
  engine.runAndWait()

sentences = ["The red house.", "A big dog."]

thread = threading.Thread(target=speak_sentences, args=(sentences,))
thread.start()
thread.join() 

In [10]:
import pyttsx3
from pydub import AudioSegment
import tempfile

sentences = [
  "The red house.",
  "A big dog.",
  "The white house",
  "A small dog"
]

engine = pyttsx3.init()
engine.setProperty("rate", 100)
engine.setProperty("volume", 1.0)

audio_segments = []

for sentence in sentences:
  # With "delete=False" or runAndWait() won't be able to process them because they will be deleted before it reaches it
  with tempfile.NamedTemporaryFile(delete=False, suffix=".wav") as temp_audio:
    temp_filename = temp_audio.name
    engine.save_to_file(sentence, temp_filename)  # Save speech to a temp WAV file
    print(f"Saved to {temp_audio.name}")
  engine.runAndWait()

  speech_audio = AudioSegment.from_file(temp_audio.name, format="wav")
  audio_segments.append(speech_audio)

  # Add a 5-second silent pause
  silent_segment = AudioSegment.silent(duration=5000)
  audio_segments.append(silent_segment)

final_audio = sum(audio_segments, AudioSegment.silent(duration=0))

final_audio.export("test_4_sentences_with_pauses.mp3", format="mp3")

Saved to C:\Users\Oliver\AppData\Local\Temp\tmp8tf7s7_y.wav
Saved to C:\Users\Oliver\AppData\Local\Temp\tmppt_k_0bl.wav
Saved to C:\Users\Oliver\AppData\Local\Temp\tmpm9sm1wny.wav
Saved to C:\Users\Oliver\AppData\Local\Temp\tmp1j3l_ljt.wav


<_io.BufferedRandom name='test_4_sentences_with_pauses.mp3'>

In [2]:
from gtts import gTTS
import os

text = "The red house. A big dog."
tts = gTTS(text=text, lang="en", tld="us", slow=True) # American accent

tts.save("test.mp3")

os.system("start test.mp3")  # Play in Windows

0

In [5]:
from gtts import gTTS
import os

text = "A red house. A big dog."
tts = gTTS(text=text, lang="en", tld="co.uk", slow=True) # British accent

tts.save("test.mp3")

os.system("start test.mp3")  # Play in Windows

0

In [4]:
# Check if gTTs uses SSML (it doesn't)
from gtts import gTTS
import os

text = "<speak>A red house.<break time='5s'/> A big dog.</speak>"
tts = gTTS(text=text, lang="en", tld="co.uk", slow=True) # British accent

tts.save("test.mp3")

os.system("start test.mp3")  # Play in Windows

0